Tag: Unix

Unix  Code more shell scripts (Part. II)

With NGS data, we are experiencing to handle very (very very) large datasets. Whatever the coding language used (PERL, python, C…), the open/read part of a program is (always) very slow. Thus, the debug process could really be a pain due to the time to read the file. Using system unix tools is one of the approach to cut /merge / debug / analyze the content of the files…

Wanna split?

One column contains data with this format XXXX_YYYY and you want to have XXXX and YYYY and two different columns? Easy, use (again) the tr command:

tr '_' '\t' < file > new_file

Faster than a PERL program!

What is my header content?

Somebody has sent you a large file with many (many) columns and you want to know what is the header content (to extract the goog columns with the cut command, for instance)? Whatever the separator used (tab, comma, etc), reading and counting columns on screen is a pain… First, extract the first line with:

head -1 file > header_content

Then, transform your column separator in a carriage-return character (here for a TSV file):

tr '\t' '\n' < header_content > list_header

Finally, to display the header, type:

cat -n list_header

(the -n option will display the line number, easier for the cut command)

How many [XXXX] in my file?

Let’s end with a short introduction about the grep command (I will prepare a more consequent post only for grep next time). The ‘grep’ command is powerful, no doubt about that, but not so easy the first time! Basically, the grep syntax is:

grep [options] my_pattern file

One usefull option is the -w parameter (means ‘word’). Thus, if you only want to ‘grep’ a specific pattern (i.e. chr and not chromosome), you should use this parameter.

Unix  Code more shell scripts (Part. I)

I’ve noticed, that I didn’t code a lot using shell language: very few sh scripts (for multiple commands executions), only basic commands. It’s a mistake, since for file management (for instance), shell scripts are more powerful than any program you could code using « higher » language. So, let’s start with some usefull commands…

Columns work…

Use the ‘cut’ command to extract the 1st and the 3th columns:

cut -f 1,3 file > new_file

I’d encourage you to type ‘man cut’ to access the whole documentation of this command: there’s another input parameter, as the ‘-d’, which permits to set the delimiter between column.

Extract the 1st column and keep unique element of this column:

cut -f 1 file | sort | uniq > unique_id

Please notice that you have to sort the data before extracting the unique element, since the ‘uniq’ command discards only the  successive identical lines from the input.

Remove the d?*$% ^M character (aka windows-carriage-return)

You may have friends or colleagues which are still working with Windows (everybody has his weak point)… But their files are a bit annoying… Indeed, there are full of ‘^M’ with no carriage return! Your program will go wrong when analysing these files (since your *nix program will search for the ‘\n’ character for the end of the line)… How to remove them? There’s plenty of solutions using perl, awk, sed, but, again, the simpliest (in my opinion) manner  is to use the basic unix command: ‘tr’…

>cat windows_file
aaa     123^Mbbb        456^Mccc        789^Mddd        123^Meee        456
> tr '\r' '\n' < windows_file > unix_file
> cat unix_file
aaa     123
bbb     456
ccc     789
ddd     123
eee     456

Nice, isn’t it? Of course the ‘tr’ command wasn’t invent to translate the ^M character from windows file (!!!). Using the ‘-d’ parameter, you can also delete some characters in a file.

Let’s loop…

This is my oldest csh script (from the 20th century!!!), but maybe my usefull one! The principle is quite simple: assume you have several datafiles to analyse using a program, this script will get the number of analyses to run and will run the main program with each datafiles. Here’s the script:

#!/bin/csh
# get file of filename
 set file = $1
# get number of datafiles
set nl = `wc -l < $file`
set i = 1

# loop on datafiles
while ($i <= $nl)
	set line = `head -$i $file | tail +$i`
	path_to_program $line &
	@ i++
end

echo Done.

The trickiest part is the ‘head-tail’ combination. Let’s explain: this command extracts the x first line (head) and returns the last line of this extract (tail)… Then, during the first loop, the first line of the file is extracted and the ‘tail’ command extracts the last (i.e. the same, actually). During the second loop, the ‘head’ command extracts the 2 first lines, and the ‘tail’ command still extracts the last, and so on… At each iter, ‘$line’ contains the name of a datafile. This kind of script is particularly usefull to submit an executable script to a batch server (i.e. using the qsub command on PBS system, for instance).

Let’s manipulate (large) datafile

The last case: you have thousands of datafiles of GEO experiments, which contain many lines of gene expression data for many NCBI geneid (1st column). You want to regroup all the gene expression data from many GSE for each geneid… Typically, you have to read each GSE file, extract lines corresponding to each geneid and put them into separate files. First, I tried to perform this task with a PERL program (too long). Then, a colleague told to do it with shell script and show me the way (thx)… Here’s the final script:

#!/bin/bash

echo "merging datafiles..."
cat GSE* > dump

echo "get unique GeneIDs..."
listID=`cat dump | sort | cut -f 1 | uniq`

echo "exporting gene expression data to GeneID files..."
for id in $listID
do
	grep "^$id\W" dump > $id
done
rm dump
echo "done."

The most difficult part was to find out the right pattern in the grep regex to only extract the corresponding geneid lines.

That’s all for this time. Hope this will help!

PERL  MacOS X 10.6.x, FINK, Macport, DBD::mysql and PERL…

Recently, I updated my laptop to snow leopard. After installing a bunch of softwares, I made some test with PERL… And then, with a PERL program working with a SQL database :

dyld: lazy symbol binding failed: Symbol not found: _mysql_init
  Referenced from: /Library/Perl/5.10.0/darwin-thread-multi-2level/auto/DBD/mysql/mysql.bundle
  Expected in: flat namespace

dyld: Symbol not found: _mysql_init
  Referenced from: /Library/Perl/5.10.0/darwin-thread-multi-2level/auto/DBD/mysql/mysql.bundle
  Expected in: flat namespace

Trace/BPT trap

Argl ! According to this perlmonk discussion, there’s a problem with FINK, snow leopard and some dynamic libraries… Great ! So, I immediately removed fink from my computer and installed macport. Actually, I didn’t want to install macport, since this package manager always installed a lot of dependancies, without checking if these dependancies are already available in the system (i.e. PERL…). But, as FINK crash my PERL-SQL system (and as others package managers don’t have so many packages than macport), I had finally installed macport on my laptop. And, again, macport had downloaded and installed another PERL distribution on my computer (thanks, you’re welcome)… So, by default, your perl and cpan programs will be /opt/local/bin…

How to use the apple PERL distribution ? I simply change the following line in my .profile:

export PATH=/opt/local/bin:/opt/local/sbin:$PATH

by

export PATH=$PATH:/opt/local/bin:/opt/local/sbin

Then, my /usr/bin PATH is on top (where the apple perl program is) and the ‘which perl‘ command gives me /usr/bin/perl. Works for me…

Unix  How to mirror a FTP site?

For research purposes, I want to mirror the GEO directory of the NCBI FTP site. These data are huge: more than 1,000 annotation platforms files and more than 20,000 GSE files! Platforms files are updated each month and GSE files on a daily basis. As you may imagine, I quickly forgot the idea of developping some PERL programs to achieve this task.

I first try to use a software dedicated to the mirroring of scientific databases: BioMaj. With the annotation platforms files, BioMaj works fine: files were quickly downloaded and inserted in SQL databases with the management of post-process scripts included in BioMaj. But, with the GSE files, I think I reached the limit of BioMaj, because of the structure of the GSE datafiles. Indeed, the « seriesmatrix » directory is organized in many subdirectories (almost 20,000 in which there are at least one file) and BioMaj spent more than 6 hours only to list the directory content!!! Then, the downloading and the management of the files (as BioMaj handles files versioning) were quite long, too! More problematic is that BioMaj freezed / halted / stopped during the update process. After a while, I finally try another way to make a local mirror: using classic unix tools, as… lftp, which has a mirroring and a multi-threading features!

Using lftp is quite simple, I found some usefull shell scripts here, which I’ve modified:

#!/bin/bash
HOST="ftp.ncbi.nih.gov"
USER="anonymous"
PASS="anonymous"
LOG_dir="/data/GEO/logs"
LCD="/data/GEO/platforms"
RCD="/pub/geo/DATA/annotation/platforms"
DATE=`date '+%Y-%m-%d_%H%M'`
lftp -c "set ftp:list-options -a;
open ftp://$USER:$PASS@$HOST;
lcd $LCD;
cd $RCD;
mirror --verbose --exclude-glob old/ --parallel=10 --log="$LOG_dir/GPL_$DATE" "

This script works with the GSE datafiles and it’s relatively fast (about 45 min to get the directory listing and less than one night (!) to download the files.

PERL  Flush file output

If you want to write something to a file (i.e write to a log file), PERL will handle a buffer and won’t write instantly to the file. In most case, this behaviour is the best (since writing to a file will downgrade performance of your program). But, it is not what you will expect when you want to write to a logfile (as your program may stop, leaving the file empty). So, in this case, you should use binmode to give your filehandle a unix layer, which is unbuffered :

open (LOG, ">".$filename) || die "cannot create file\n";
binmode (LOG, ":unix");

Then, your PERL program will write instantly to your logfile…

PERL  UNIX & PERL starter kit…

If you’re looking for a nice introductory lesson to learn the basis of UNIX commands and PERL programming aspects, I recommend you to get the PDF file that you would find on the Korf lab website. The lesson is entitled « Unix and Perl Primer for Biologists« . This course is aimed to people « with no prior experience » in either programming or UNIX (as mentionned in the web page). I think it might be usefull for a lot of people. Enjoy…

PERL  Control your jobs!

Let’s explain the situation: you have a lot of analyses to compute on your SMP computer (same analysis on different files, many different analyses or same analysis many times, for simulation purpose, for instance). And, for different reasons, you cannot use or you don’t have access to a cluster with many nodes… First, forget to launch job one by one and forget also to launch all the jobs in one command (since your server may go down)!

You can perform such tasks by writing csh scripts, which will control your jobs and fill the job queue. Maybe you can also install some queue manager on your server. By looking at the web, I didn’t find a “simple” way to solve this problem, so I wrote a simple PERL script, which manages a queue jobs list. This script is based on the Proc::Simple module (available here). The full documentation of this script is available here. Basically, you can choose between 3 analysis mode: many different analyses, same analysis on different input datafiles and same analysis many times. You also have to specify the number of simultaneous jobs to run (i.e. the number of available cpu) and the script will manage your queue job list. I’m not sure it’s the best way but it works for me.

I hope it will be useful and feel free to comment this article.

Unix  *nix command line tips

One thing which is definitively amazing with *nix OS is that you always learn new things… Let have a look at these commands:

  • bc : a nice command-line calculator… To my mind, I’ve always considered that GUI calculator are boring to use (left clic on each number to calculate)… GNU bc provides a calculator that allows you to type in expressions for immediate calculation. It uses the standard conventions for computer arithmetic, i.e. + and – are addition and subtraction, * and / are multiplication and division, ^ is exponentiation. Juts type bc in a terminal and use it as follows:
    bc 1.06
    Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
    This is free software with ABSOLUTELY NO WARRANTY.
    For details type `warranty’.
    45*7
    315
    Parentheses can be use to set the order of evaluation (as usual)
    (20/7)^4
    16
    But, as you may have noticed, this result is “wrong”… Just because the number of digit is set to 0 by default (i.e. precision). You can change it by using the special scale variable:
    scale=2
    (20/7)^4
    65.97
    scale=5
    (20/7)^4
    66.63863
    Depending your arithmetic computation, be careful with the precision scale with some arithmetic operation (for instance, exponentiation). Finally, you can type quit to leave bc. You can do a lot of thing with bc, don’t forget to read the man page (as bc is a kind of interactive programming language), just type man bc from the command line.
  • leave : a terminal reminder… Have to leave at 15H pm ? Need to rest in 1H ? Use the leave command ! Simply type leave 1500 or leave +0100, respectively… Keep in mind that leave will continue to remind you (to leave) since you close your terminal !

Unix  XML::SAX conflict between Debian & CPAN

If you have made a previous install of XML::SAX via CPAN and then you try to install a debian package that required this module, you will probably get this error:

Fatal: cannot call XML::SAX->save_parsers_debian().
You probably have a locally installed XML::SAX module.
See /usr/share/doc/libxml-sax-perl/README.Debian.gz

for further details on this error, you can read this bug report page. The soluce is to remove the XML::SAX modules from CPAN. It can be performed by using this script:

#!/usr/bin/perl -w

use ExtUtils::Packlist;
use ExtUtils::Installed;

$ARGV[0] or die "Usage: $0 Module::Name\n";
my $mod = $ARGV[0];
my $inst = ExtUtils::Installed->new();

foreach my $item (sort($inst->files($mod)))
{
	print "removing $item\n";
	unlink $item;
}

my $packfile = $inst->packlist($mod)->packlist_file();
print "removing $packfile\n";
unlink $packfile;

Then, type the following commands (root):

chmod u+x rm_perl_mod.pl
./rm_perl_mod.pl XML::SAX
apt-get upgrade

It works for me…

Unix  The beauty of screen

How to launch analyses in background on a *nix system ? You can use the « & » at the end of your command, but the screen output of your analyses will be messy and you cannot close your session. You also can use the batch command but you will lose the screen output and there will be no way to interact with your software. To my mind, the best alternative is the use of the screen command, which is detailled in this article.

To use your favorite cpu-intensive software, use the following command:

screen

Then, you will start with a new fresh shell session and in this session, you can launch your analysis. When it’s done, type « Ctrl-A Ctrl-D » to detached your screen: you will go back to your old shell session that launched the screen session. Now, you can log out.

After a while, to reconnect to your « screen » session, use the command:

screen -list
There are screens on:
3199.pts-1.server (Detached)
3248.pts-1.server (Detached)
2 Sockets in /var/run/screen/S-user

So, to attach to the session on « 3199.pts-1.server », just type:

screen -r 3199.pts-1.server

and that’s it! When your job is done, you can leave your screen session by typing « Ctrl-D »… Don’t forget to read the man page for further details on this « magic » command.