février 2011

Unix  Code more shell scripts (Part. I)

I’ve noticed, that I didn’t code a lot using shell language: very few sh scripts (for multiple commands executions), only basic commands. It’s a mistake, since for file management (for instance), shell scripts are more powerful than any program you could code using « higher » language. So, let’s start with some usefull commands…

Columns work…

Use the ‘cut’ command to extract the 1st and the 3th columns:

cut -f 1,3 file > new_file

I’d encourage you to type ‘man cut’ to access the whole documentation of this command: there’s another input parameter, as the ‘-d’, which permits to set the delimiter between column.

Extract the 1st column and keep unique element of this column:

cut -f 1 file | sort | uniq > unique_id

Please notice that you have to sort the data before extracting the unique element, since the ‘uniq’ command discards only the  successive identical lines from the input.

Remove the d?*$% ^M character (aka windows-carriage-return)

You may have friends or colleagues which are still working with Windows (everybody has his weak point)… But their files are a bit annoying… Indeed, there are full of ‘^M’ with no carriage return! Your program will go wrong when analysing these files (since your *nix program will search for the ‘\n’ character for the end of the line)… How to remove them? There’s plenty of solutions using perl, awk, sed, but, again, the simpliest (in my opinion) manner  is to use the basic unix command: ‘tr’…

>cat windows_file
aaa     123^Mbbb        456^Mccc        789^Mddd        123^Meee        456
> tr '\r' '\n' < windows_file > unix_file
> cat unix_file
aaa     123
bbb     456
ccc     789
ddd     123
eee     456

Nice, isn’t it? Of course the ‘tr’ command wasn’t invent to translate the ^M character from windows file (!!!). Using the ‘-d’ parameter, you can also delete some characters in a file.

Let’s loop…

This is my oldest csh script (from the 20th century!!!), but maybe my usefull one! The principle is quite simple: assume you have several datafiles to analyse using a program, this script will get the number of analyses to run and will run the main program with each datafiles. Here’s the script:

# get file of filename
 set file = $1
# get number of datafiles
set nl = `wc -l < $file`
set i = 1

# loop on datafiles
while ($i <= $nl)
	set line = `head -$i $file | tail +$i`
	path_to_program $line &
	@ i++

echo Done.

The trickiest part is the ‘head-tail’ combination. Let’s explain: this command extracts the x first line (head) and returns the last line of this extract (tail)… Then, during the first loop, the first line of the file is extracted and the ‘tail’ command extracts the last (i.e. the same, actually). During the second loop, the ‘head’ command extracts the 2 first lines, and the ‘tail’ command still extracts the last, and so on… At each iter, ‘$line’ contains the name of a datafile. This kind of script is particularly usefull to submit an executable script to a batch server (i.e. using the qsub command on PBS system, for instance).

Let’s manipulate (large) datafile

The last case: you have thousands of datafiles of GEO experiments, which contain many lines of gene expression data for many NCBI geneid (1st column). You want to regroup all the gene expression data from many GSE for each geneid… Typically, you have to read each GSE file, extract lines corresponding to each geneid and put them into separate files. First, I tried to perform this task with a PERL program (too long). Then, a colleague told to do it with shell script and show me the way (thx)… Here’s the final script:


echo "merging datafiles..."
cat GSE* > dump

echo "get unique GeneIDs..."
listID=`cat dump | sort | cut -f 1 | uniq`

echo "exporting gene expression data to GeneID files..."
for id in $listID
	grep "^$id\W" dump > $id
rm dump
echo "done."

The most difficult part was to find out the right pattern in the grep regex to only extract the corresponding geneid lines.

That’s all for this time. Hope this will help!