Unix  Code more shell scripts (Part. II)

With NGS data, we are experiencing to handle very (very very) large datasets. Whatever the coding language used (PERL, python, C…), the open/read part of a program is (always) very slow. Thus, the debug process could really be a pain due to the time to read the file. Using system unix tools is one of the approach to cut /merge / debug / analyze the content of the files…

Wanna split?

One column contains data with this format XXXX_YYYY and you want to have XXXX and YYYY and two different columns? Easy, use (again) the tr command:

tr '_' '\t' < file > new_file

Faster than a PERL program!

What is my header content?

Somebody has sent you a large file with many (many) columns and you want to know what is the header content (to extract the goog columns with the cut command, for instance)? Whatever the separator used (tab, comma, etc), reading and counting columns on screen is a pain… First, extract the first line with:

head -1 file > header_content

Then, transform your column separator in a carriage-return character (here for a TSV file):

tr '\t' '\n' < header_content > list_header

Finally, to display the header, type:

cat -n list_header

(the -n option will display the line number, easier for the cut command)

How many [XXXX] in my file?

Let’s end with a short introduction about the grep command (I will prepare a more consequent post only for grep next time). The ‘grep’ command is powerful, no doubt about that, but not so easy the first time! Basically, the grep syntax is:

grep [options] my_pattern file

One usefull option is the -w parameter (means ‘word’). Thus, if you only want to ‘grep’ a specific pattern (i.e. chr and not chromosome), you should use this parameter.

Unix  Code more shell scripts (Part. I)

I’ve noticed, that I didn’t code a lot using shell language: very few sh scripts (for multiple commands executions), only basic commands. It’s a mistake, since for file management (for instance), shell scripts are more powerful than any program you could code using « higher » language. So, let’s start with some usefull commands…

Columns work…

Use the ‘cut’ command to extract the 1st and the 3th columns:

cut -f 1,3 file > new_file

I’d encourage you to type ‘man cut’ to access the whole documentation of this command: there’s another input parameter, as the ‘-d’, which permits to set the delimiter between column.

Extract the 1st column and keep unique element of this column:

cut -f 1 file | sort | uniq > unique_id

Please notice that you have to sort the data before extracting the unique element, since the ‘uniq’ command discards only the  successive identical lines from the input.

Remove the d?*$% ^M character (aka windows-carriage-return)

You may have friends or colleagues which are still working with Windows (everybody has his weak point)… But their files are a bit annoying… Indeed, there are full of ‘^M’ with no carriage return! Your program will go wrong when analysing these files (since your *nix program will search for the ‘\n’ character for the end of the line)… How to remove them? There’s plenty of solutions using perl, awk, sed, but, again, the simpliest (in my opinion) manner  is to use the basic unix command: ‘tr’…

>cat windows_file
aaa     123^Mbbb        456^Mccc        789^Mddd        123^Meee        456
> tr '\r' '\n' < windows_file > unix_file
> cat unix_file
aaa     123
bbb     456
ccc     789
ddd     123
eee     456

Nice, isn’t it? Of course the ‘tr’ command wasn’t invent to translate the ^M character from windows file (!!!). Using the ‘-d’ parameter, you can also delete some characters in a file.

Let’s loop…

This is my oldest csh script (from the 20th century!!!), but maybe my usefull one! The principle is quite simple: assume you have several datafiles to analyse using a program, this script will get the number of analyses to run and will run the main program with each datafiles. Here’s the script:

# get file of filename
 set file = $1
# get number of datafiles
set nl = `wc -l < $file`
set i = 1

# loop on datafiles
while ($i <= $nl)
	set line = `head -$i $file | tail +$i`
	path_to_program $line &
	@ i++

echo Done.

The trickiest part is the ‘head-tail’ combination. Let’s explain: this command extracts the x first line (head) and returns the last line of this extract (tail)… Then, during the first loop, the first line of the file is extracted and the ‘tail’ command extracts the last (i.e. the same, actually). During the second loop, the ‘head’ command extracts the 2 first lines, and the ‘tail’ command still extracts the last, and so on… At each iter, ‘$line’ contains the name of a datafile. This kind of script is particularly usefull to submit an executable script to a batch server (i.e. using the qsub command on PBS system, for instance).

Let’s manipulate (large) datafile

The last case: you have thousands of datafiles of GEO experiments, which contain many lines of gene expression data for many NCBI geneid (1st column). You want to regroup all the gene expression data from many GSE for each geneid… Typically, you have to read each GSE file, extract lines corresponding to each geneid and put them into separate files. First, I tried to perform this task with a PERL program (too long). Then, a colleague told to do it with shell script and show me the way (thx)… Here’s the final script:


echo "merging datafiles..."
cat GSE* > dump

echo "get unique GeneIDs..."
listID=`cat dump | sort | cut -f 1 | uniq`

echo "exporting gene expression data to GeneID files..."
for id in $listID
	grep "^$id\W" dump > $id
rm dump
echo "done."

The most difficult part was to find out the right pattern in the grep regex to only extract the corresponding geneid lines.

That’s all for this time. Hope this will help!

Unix  How to mirror a FTP site?

For research purposes, I want to mirror the GEO directory of the NCBI FTP site. These data are huge: more than 1,000 annotation platforms files and more than 20,000 GSE files! Platforms files are updated each month and GSE files on a daily basis. As you may imagine, I quickly forgot the idea of developping some PERL programs to achieve this task.

I first try to use a software dedicated to the mirroring of scientific databases: BioMaj. With the annotation platforms files, BioMaj works fine: files were quickly downloaded and inserted in SQL databases with the management of post-process scripts included in BioMaj. But, with the GSE files, I think I reached the limit of BioMaj, because of the structure of the GSE datafiles. Indeed, the « seriesmatrix » directory is organized in many subdirectories (almost 20,000 in which there are at least one file) and BioMaj spent more than 6 hours only to list the directory content!!! Then, the downloading and the management of the files (as BioMaj handles files versioning) were quite long, too! More problematic is that BioMaj freezed / halted / stopped during the update process. After a while, I finally try another way to make a local mirror: using classic unix tools, as… lftp, which has a mirroring and a multi-threading features!

Using lftp is quite simple, I found some usefull shell scripts here, which I’ve modified:

DATE=`date '+%Y-%m-%d_%H%M'`
lftp -c "set ftp:list-options -a;
open ftp://$USER:$PASS@$HOST;
lcd $LCD;
cd $RCD;
mirror --verbose --exclude-glob old/ --parallel=10 --log="$LOG_dir/GPL_$DATE" "

This script works with the GSE datafiles and it’s relatively fast (about 45 min to get the directory listing and less than one night (!) to download the files.

Unix  *nix command line tips

One thing which is definitively amazing with *nix OS is that you always learn new things… Let have a look at these commands:

  • bc : a nice command-line calculator… To my mind, I’ve always considered that GUI calculator are boring to use (left clic on each number to calculate)… GNU bc provides a calculator that allows you to type in expressions for immediate calculation. It uses the standard conventions for computer arithmetic, i.e. + and – are addition and subtraction, * and / are multiplication and division, ^ is exponentiation. Juts type bc in a terminal and use it as follows:
    bc 1.06
    Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
    This is free software with ABSOLUTELY NO WARRANTY.
    For details type `warranty’.
    Parentheses can be use to set the order of evaluation (as usual)
    But, as you may have noticed, this result is “wrong”… Just because the number of digit is set to 0 by default (i.e. precision). You can change it by using the special scale variable:
    Depending your arithmetic computation, be careful with the precision scale with some arithmetic operation (for instance, exponentiation). Finally, you can type quit to leave bc. You can do a lot of thing with bc, don’t forget to read the man page (as bc is a kind of interactive programming language), just type man bc from the command line.
  • leave : a terminal reminder… Have to leave at 15H pm ? Need to rest in 1H ? Use the leave command ! Simply type leave 1500 or leave +0100, respectively… Keep in mind that leave will continue to remind you (to leave) since you close your terminal !

Unix  XML::SAX conflict between Debian & CPAN

If you have made a previous install of XML::SAX via CPAN and then you try to install a debian package that required this module, you will probably get this error:

Fatal: cannot call XML::SAX->save_parsers_debian().
You probably have a locally installed XML::SAX module.
See /usr/share/doc/libxml-sax-perl/README.Debian.gz

for further details on this error, you can read this bug report page. The soluce is to remove the XML::SAX modules from CPAN. It can be performed by using this script:

#!/usr/bin/perl -w

use ExtUtils::Packlist;
use ExtUtils::Installed;

$ARGV[0] or die "Usage: $0 Module::Name\n";
my $mod = $ARGV[0];
my $inst = ExtUtils::Installed->new();

foreach my $item (sort($inst->files($mod)))
	print "removing $item\n";
	unlink $item;

my $packfile = $inst->packlist($mod)->packlist_file();
print "removing $packfile\n";
unlink $packfile;

Then, type the following commands (root):

chmod u+x rm_perl_mod.pl
./rm_perl_mod.pl XML::SAX
apt-get upgrade

It works for me…

Unix  The beauty of screen

How to launch analyses in background on a *nix system ? You can use the « & » at the end of your command, but the screen output of your analyses will be messy and you cannot close your session. You also can use the batch command but you will lose the screen output and there will be no way to interact with your software. To my mind, the best alternative is the use of the screen command, which is detailled in this article.

To use your favorite cpu-intensive software, use the following command:


Then, you will start with a new fresh shell session and in this session, you can launch your analysis. When it’s done, type « Ctrl-A Ctrl-D » to detached your screen: you will go back to your old shell session that launched the screen session. Now, you can log out.

After a while, to reconnect to your « screen » session, use the command:

screen -list
There are screens on:
3199.pts-1.server (Detached)
3248.pts-1.server (Detached)
2 Sockets in /var/run/screen/S-user

So, to attach to the session on « 3199.pts-1.server », just type:

screen -r 3199.pts-1.server

and that’s it! When your job is done, you can leave your screen session by typing « Ctrl-D »… Don’t forget to read the man page for further details on this « magic » command.