Unix  Code more shell scripts (Part. II)

With NGS data, we are experiencing to handle very (very very) large datasets. Whatever the coding language used (PERL, python, C…), the open/read part of a program is (always) very slow. Thus, the debug process could really be a pain due to the time to read the file. Using system unix tools is one of the approach to cut /merge / debug / analyze the content of the files…

Wanna split?

One column contains data with this format XXXX_YYYY and you want to have XXXX and YYYY and two different columns? Easy, use (again) the tr command:

tr '_' '\t' < file > new_file

Faster than a PERL program!

What is my header content?

Somebody has sent you a large file with many (many) columns and you want to know what is the header content (to extract the goog columns with the cut command, for instance)? Whatever the separator used (tab, comma, etc), reading and counting columns on screen is a pain… First, extract the first line with:

head -1 file > header_content

Then, transform your column separator in a carriage-return character (here for a TSV file):

tr '\t' '\n' < header_content > list_header

Finally, to display the header, type:

cat -n list_header

(the -n option will display the line number, easier for the cut command)

How many [XXXX] in my file?

Let’s end with a short introduction about the grep command (I will prepare a more consequent post only for grep next time). The ‘grep’ command is powerful, no doubt about that, but not so easy the first time! Basically, the grep syntax is:

grep [options] my_pattern file

One usefull option is the -w parameter (means ‘word’). Thus, if you only want to ‘grep’ a specific pattern (i.e. chr and not chromosome), you should use this parameter.

software  DGD accepted for publication in PLOS ONE

The paper describing the construction and the content of this database had been accepted for publication and the article is now available on the PLOS ONE website. We improved the first release by adding data from GEO and GO database. These addition permitted to analyse the gene coexpression level and the annotation similarity inside each group of duplicated genes (see below).

 

The database is available here : http://dgd.genouest.org. A contact form is available on the official AnnotQTL website, you can leave a comment or ask for new species.

software  DGD website released!

The Duplicated Gene Database website has been released! This database provides a simple way to quickly and easily select groups of tandem duplicate or large groups of multigene family by gene identifier, chromosomal location and/or keywords. This database could be useful for various fields of applications (structural genomic, gene expression profiling, gene mapping, gene annotation) and could be expanded to others genomes as genomic annotations and peptide sequences are available.

homepage

http://dgd.genouest.org

Blablabla  So true…

source

software  AnnotQTL accepted for publication in the NAR 2011 webserver edition

NAR just have let us know that our manuscript of AnnotQTL has been accepted for publication. The manuscript will be online in 2-3 weeks! In addition to the first release of AnnotQTL, we have added several ‘missing’ features:

  • The assembly version and the date of the request are now inserted into the exported text file (TSV or XML).
  • You can now run multiple analyses of several QTL regions (via a specific form in the ‘submit a request’ section). Of course, in this ‘multiple analyses’ mode, the highlight features still work (but for a unique biological function, i.e.: reproduction trait for all the QTL regions).

AnnotQTL can be found at http://annotqtl.genouest.org. We have also decided to add new features in a close future (based on the referee’s comments), such as to define a genomic region based on the STS markers surrounding a QTL region. A contact form is available on the official AnnotQTL website, you can leave a comment or ask for new species. The article is available here.

Unix  Code more shell scripts (Part. I)

I’ve noticed, that I didn’t code a lot using shell language: very few sh scripts (for multiple commands executions), only basic commands. It’s a mistake, since for file management (for instance), shell scripts are more powerful than any program you could code using « higher » language. So, let’s start with some usefull commands…

Columns work…

Use the ‘cut’ command to extract the 1st and the 3th columns:

cut -f 1,3 file > new_file

I’d encourage you to type ‘man cut’ to access the whole documentation of this command: there’s another input parameter, as the ‘-d’, which permits to set the delimiter between column.

Extract the 1st column and keep unique element of this column:

cut -f 1 file | sort | uniq > unique_id

Please notice that you have to sort the data before extracting the unique element, since the ‘uniq’ command discards only the  successive identical lines from the input.

Remove the d?*$% ^M character (aka windows-carriage-return)

You may have friends or colleagues which are still working with Windows (everybody has his weak point)… But their files are a bit annoying… Indeed, there are full of ‘^M’ with no carriage return! Your program will go wrong when analysing these files (since your *nix program will search for the ‘\n’ character for the end of the line)… How to remove them? There’s plenty of solutions using perl, awk, sed, but, again, the simpliest (in my opinion) manner  is to use the basic unix command: ‘tr’…

>cat windows_file
aaa     123^Mbbb        456^Mccc        789^Mddd        123^Meee        456
> tr '\r' '\n' < windows_file > unix_file
> cat unix_file
aaa     123
bbb     456
ccc     789
ddd     123
eee     456

Nice, isn’t it? Of course the ‘tr’ command wasn’t invent to translate the ^M character from windows file (!!!). Using the ‘-d’ parameter, you can also delete some characters in a file.

Let’s loop…

This is my oldest csh script (from the 20th century!!!), but maybe my usefull one! The principle is quite simple: assume you have several datafiles to analyse using a program, this script will get the number of analyses to run and will run the main program with each datafiles. Here’s the script:

#!/bin/csh
# get file of filename
 set file = $1
# get number of datafiles
set nl = `wc -l < $file`
set i = 1

# loop on datafiles
while ($i <= $nl)
	set line = `head -$i $file | tail +$i`
	path_to_program $line &
	@ i++
end

echo Done.

The trickiest part is the ‘head-tail’ combination. Let’s explain: this command extracts the x first line (head) and returns the last line of this extract (tail)… Then, during the first loop, the first line of the file is extracted and the ‘tail’ command extracts the last (i.e. the same, actually). During the second loop, the ‘head’ command extracts the 2 first lines, and the ‘tail’ command still extracts the last, and so on… At each iter, ‘$line’ contains the name of a datafile. This kind of script is particularly usefull to submit an executable script to a batch server (i.e. using the qsub command on PBS system, for instance).

Let’s manipulate (large) datafile

The last case: you have thousands of datafiles of GEO experiments, which contain many lines of gene expression data for many NCBI geneid (1st column). You want to regroup all the gene expression data from many GSE for each geneid… Typically, you have to read each GSE file, extract lines corresponding to each geneid and put them into separate files. First, I tried to perform this task with a PERL program (too long). Then, a colleague told to do it with shell script and show me the way (thx)… Here’s the final script:

#!/bin/bash

echo "merging datafiles..."
cat GSE* > dump

echo "get unique GeneIDs..."
listID=`cat dump | sort | cut -f 1 | uniq`

echo "exporting gene expression data to GeneID files..."
for id in $listID
do
	grep "^$id\W" dump > $id
done
rm dump
echo "done."

The most difficult part was to find out the right pattern in the grep regex to only extract the corresponding geneid lines.

That’s all for this time. Hope this will help!

software  AnnotQTL website has been released!

Recently, we have released a new website: AnnotQTL, a web tool designed to gather the functional annotation of different prominent website though limiting the redundancy of information.

The last steps of genetic mapping research programs require to closely analyze several QTL regions to select candidate genes for further studies. Despite the existence of several websites (NCBI genome browser, Ensembl Browser and UCSC Genome browser) or web tools (Biomart, Galaxy) to achieve this task, the selection of candidate genes is still laborious. Indeed, information available on the prominent websites slightly differs in terms of gene prediction and functional annotation, and some other websites provide extra-information that researchers may want to use. Merging and comparing this information can be done manually for one QTL containing few genes but would be hardly possible for many different QTL regions and dozen of genes. Here we propose a web tool that, for the region of interest, merges the list of genes available in NCBI and Ensembl, removes redundancy, adds the functional annotation of different prominent web site and highlights the genes for which the functional annotation fits the biological function or diseases of interest. This tool is dedicated to sequenced livestock animal species (cattle, pig, chicken, and horse) and the dog as they have been extensively studied (indeed, more than 8000 QTL were detected).

The AnnotQTL server could be found here : http://annotqtl.genouest.org/

Blablabla  Manage your biblio: Papers for mac

I already had a licence of Endnote X2 on my computer, mostly to insert citations in word. In my opinion, Endnote is not a good software: it’s slow, heavy and not appropriate to manage your bibliography. Indeed, you cannot create sub-groups in your custom groups, associate keywords, use smart folders (maybe people from thomson don’t use mac?), etc, etc… But, it’s the only one that propose a decent integration in word to insert citations. BTW, you should note that you always have to upgrade to the new version if you update your microsoft software (nice, isn’t it ?). No CWYW update is offered for older versions of endnote if you buy the office 2011 (and, right now, endnote X4 still doesn’t support office 2011, very nice, as I said…).

papers_mainSo, I had recently bought a licence for Papers for my macbook to manage my bibliography. And you can trust me, this program is awesome!!! First, the import of PDF files (great feature) works perfectly fine. If Papers cannot recognize your PDF file, you can identify the article by querying different websites. In many cases, it works. And, if not, you can select a part of your PDF and use your selection to query databases (and then, it always works!). You have a full-screen mode to read PDF, annotate the articles, create as many groups and sub-groups you want, create smart folders, search through your collection is spotlight-like, and this program is fast, really. What is missing? In my opinion, an integration with word. I sent mails to the support, they told me that this feature is planned (but when?). But, this is not the first purpose of this software: it was designed to manage your bibliography, and, it achieves this task perfectly! You should definitely give a try to Papers. The pricing policy is quite nice: 40% off for students, and only 34 € ($40) for the others (compare to the price of endnote X4: at least $249, and no student pricing).

PERL  MacOS X 10.6.x, FINK, Macport, DBD::mysql and PERL…

Recently, I updated my laptop to snow leopard. After installing a bunch of softwares, I made some test with PERL… And then, with a PERL program working with a SQL database :

dyld: lazy symbol binding failed: Symbol not found: _mysql_init
  Referenced from: /Library/Perl/5.10.0/darwin-thread-multi-2level/auto/DBD/mysql/mysql.bundle
  Expected in: flat namespace

dyld: Symbol not found: _mysql_init
  Referenced from: /Library/Perl/5.10.0/darwin-thread-multi-2level/auto/DBD/mysql/mysql.bundle
  Expected in: flat namespace

Trace/BPT trap

Argl ! According to this perlmonk discussion, there’s a problem with FINK, snow leopard and some dynamic libraries… Great ! So, I immediately removed fink from my computer and installed macport. Actually, I didn’t want to install macport, since this package manager always installed a lot of dependancies, without checking if these dependancies are already available in the system (i.e. PERL…). But, as FINK crash my PERL-SQL system (and as others package managers don’t have so many packages than macport), I had finally installed macport on my laptop. And, again, macport had downloaded and installed another PERL distribution on my computer (thanks, you’re welcome)… So, by default, your perl and cpan programs will be /opt/local/bin…

How to use the apple PERL distribution ? I simply change the following line in my .profile:

export PATH=/opt/local/bin:/opt/local/sbin:$PATH

by

export PATH=$PATH:/opt/local/bin:/opt/local/sbin

Then, my /usr/bin PATH is on top (where the apple perl program is) and the ‘which perl‘ command gives me /usr/bin/perl. Works for me…

PERL  multi-threading with PERL

Multi-threading, forking… At first, it seems to be complicated… Actually, it can be quite simple ! First, one thing you should now: your PERL installation may not support threads (this option has been set during compilation), so check it out:

perl -V
Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.30.5-dsa-amd64, archname=x86_64-linux-gnu-thread-multi
    uname='linux brahms 2.6.30.5-dsa-amd64 #1 smp mon aug 17 02:18:43 cest 2009 x86_64 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef

Here, you should look at the following options: useithreads=define. If your PERL install doesn’t have this option, I recommend you to install the forsk.pm module. During installation, the CPAN installer will ask you:

It appears your perl was not built with native ithreads.

Would you like to create references to forks, such that
using 'use threads' and 'use threads::shared' will quietly
load forks and forks::shared? [no]

I recommend you to choose yes at this point: this will permit you to develop programs with the use threads directive, whatever this options has been set during PERL compilation (note: be aware that the forks module is not as fast as the native threads options.

Then, how do we develop this threads? You can find a lot of tutorials with google, like this one, for instance. But, there’s some issues with these examples: there are all based on loops with the fixed number of iterations (do something 10 times and thread it). Okay, that’s nice for beginners. But, again, in the real world, it’s not in this way that things happen! Most of the time, you have a lot of basic operations to perform (let’s say about 500) and you don’t want to perform all the operations at the same time (as your system may go down). How to develop a system-friendly program that will only launch a limited number of threads and will perform all the tasks? That’s the point! And, again, trust me, I didn’t find any tutorials for this « real-world » case. Here I propose a program that may help you to achieve this. The algorithm is based on a while loop that compares the number of task to perform and the number of running and done threads. It can be easily adapt to any cases. Let’s say that you have 100 nucleotide sequences to analyze. These sequences are stored in a Hash table. Then, you can get the number of entry in your hash (your $nb_compute) and enter in the loop. Now, let’s stop talking and have a look at the code:

#!/opt/local/bin/perl -w
use threads;
use strict;
use warnings;

my @a = ();
my @b = ();

sub sleeping_sub ( $ $ $ );

print "Starting main program\n";

my $nb_process = 10;
my $nb_compute = 20;
my $i=0;
my @running = ();
my @Threads;
while (scalar @Threads < $nb_compute) {
 	@running = threads->list(threads::running);
	print "LOOP $i\n";
	print "  - BEGIN LOOP >> NB running threads = ".(scalar @running)."\n";

	if (scalar @running < $nb_process) {
 		my $thread = threads->new( sub { sleeping_sub($i, \@a, \@b) });
		push (@Threads, $thread);
		my $tid = $thread->tid;
		print "  - starting thread $tid\n";
	}
	@running = threads->list(threads::running);
	print "  - AFTER STARTING >> NB running Threads = ".(scalar @running)."\n";
	foreach my $thr (@Threads) {
		if ($thr->is_running()) {
			my $tid = $thr->tid;
			print "  - Thread $tid running\n";
		}
		elsif ($thr->is_joinable()) {
			my $tid = $thr->tid;
			$thr->join;
			print "  - Results for thread $tid:\n";
			print "  - Thread $tid has been joined\n";
		}
	}

	@running = threads->list(threads::running);
	print "  - END LOOP >> NB Threads = ".(scalar @running)."\n";
	$i++;
}
print "\nJOINING pending threads\n";
while (scalar @running != 0) {
	foreach my $thr (@Threads) {
		$thr->join if ($thr->is_joinable());
	}
	@running = threads->list(threads::running);
}

print "NB started threads = ".(scalar @Threads)."\n";
print "End of main program\n";

sub sleeping_sub ( $ $ $ ) {
	sleep(4);
}

During the main loop, the program will start new threads if the number of running threads is lower than the number of max threads. Still during this loop, it will join pending threads. Then, at the end of the loop, you must be aware that some threads may still running, so, another loop will join the last running threads. You should note that the parameters of the sub are not used (it’s just for the example), but you can send parameters to your favorite sub and get the results, too. To get more details about the shared data, I recommend you to read the threads perldoc. I hope it will help.