Unix  How to mirror a FTP site?

For research purposes, I want to mirror the GEO directory of the NCBI FTP site. These data are huge: more than 1,000 annotation platforms files and more than 20,000 GSE files! Platforms files are updated each month and GSE files on a daily basis. As you may imagine, I quickly forgot the idea of developping some PERL programs to achieve this task.

I first try to use a software dedicated to the mirroring of scientific databases: BioMaj. With the annotation platforms files, BioMaj works fine: files were quickly downloaded and inserted in SQL databases with the management of post-process scripts included in BioMaj. But, with the GSE files, I think I reached the limit of BioMaj, because of the structure of the GSE datafiles. Indeed, the « seriesmatrix » directory is organized in many subdirectories (almost 20,000 in which there are at least one file) and BioMaj spent more than 6 hours only to list the directory content!!! Then, the downloading and the management of the files (as BioMaj handles files versioning) were quite long, too! More problematic is that BioMaj freezed / halted / stopped during the update process. After a while, I finally try another way to make a local mirror: using classic unix tools, as… lftp, which has a mirroring and a multi-threading features!

Using lftp is quite simple, I found some usefull shell scripts here, which I’ve modified:

#!/bin/bash
HOST="ftp.ncbi.nih.gov"
USER="anonymous"
PASS="anonymous"
LOG_dir="/data/GEO/logs"
LCD="/data/GEO/platforms"
RCD="/pub/geo/DATA/annotation/platforms"
DATE=`date '+%Y-%m-%d_%H%M'`
lftp -c "set ftp:list-options -a;
open ftp://$USER:$PASS@$HOST;
lcd $LCD;
cd $RCD;
mirror --verbose --exclude-glob old/ --parallel=10 --log="$LOG_dir/GPL_$DATE" "

This script works with the GSE datafiles and it’s relatively fast (about 45 min to get the directory listing and less than one night (!) to download the files.

PERL  Who’s the boss?

Do you remember the old post where I compared the performance between different PERL modules for basic statistic? In the former, the winner was PDL. But, this time, for particular reasons, I’ve experimented to put some C code into a PERL program by using the Inline::C module. Again, the principle is quite simple: compute Pearson coefficient. Benchmark results are quite… impressive :

Benchmark: running C, bas, pdl for at least 5 CPU seconds...
         C:  5 wallclock secs ( 5.25 usr +  0.01 sys =  5.26 CPU) @ 99364.26/s (n=522656)
       bas:  6 wallclock secs ( 5.24 usr +  0.01 sys =  5.25 CPU) @ 3256.19/s (n=17095)
       pdl:  5 wallclock secs ( 5.30 usr +  0.01 sys =  5.31 CPU) @ 3855.93/s (n=20475)
       Rate   bas   pdl     C
bas  3256/s    --  -16%  -97%
pdl  3856/s   18%    --  -96%
C   99364/s 2952% 2477%    --

Ouch ! Say no more… The source code could be find here (be polite, this is my first C code into a PERL program ;-) !)…

PERL  Flush file output

If you want to write something to a file (i.e write to a log file), PERL will handle a buffer and won’t write instantly to the file. In most case, this behaviour is the best (since writing to a file will downgrade performance of your program). But, it is not what you will expect when you want to write to a logfile (as your program may stop, leaving the file empty). So, in this case, you should use binmode to give your filehandle a unix layer, which is unbuffered :

open (LOG, ">".$filename) || die "cannot create file\n";
binmode (LOG, ":unix");

Then, your PERL program will write instantly to your logfile…

PERL  Biomart and error 500…

One great thing with Biomart is that you can export your query to PERL code. You set up your query once a time and then, you can launch your PERL script to update your databases, for example (after installing the Biomart API which it’s not an easy part, especially with the registry files). It sounds great, but, in practice, it doesn’t work at all very well for big queries: Bandwitch is very very very slow, and most of the time, we will get an error 500 read timeout… So, you re-launch your script, and again, and again… After a while, you will get upset about Biomart, trust me!

So, I searched in the biomart help, and I found the MartService. As the help is « missing », I tried the example in the website. To my mind, it is not particularly clear (what the POST example means ?) or working (the wget command returned me an error). So, I tried different things. I first picked up the XML file (XML button on top right). By the way, the « Unique Results only » option didn’t work in the XML file: no matter the option was selected or not, the XML file still had the following option: uniqueRows = ’0′ (don’t forget to change to ’1′, or your files will be very very very big). After hanging around the website for a while, I had copy/paste the content of the file WebExample.pl :

# an example script demonstrating the use of BioMart webservice
use strict;
use LWP::UserAgent;

open (FH,$ARGV[0]) || die ("\nUsage: perl webExample.pl Query.xml\n\n");

my $xml;
while (<FH>){
    $xml .= $_;
}
close(FH);

my $path="http://www.biomart.org/biomart/martservice?";
my $request = HTTP::Request->new("POST",$path,HTTP::Headers->new(),'query='.$xml."\n");
my $ua = LWP::UserAgent->new;

my $response;

$ua->request($request,
	     sub{
		 my($data, $response) = @_;
		 if ($response->is_success) {
		     print "$data";
		 }
		 else {
		     warn ("Problems with the web server: ".$response->status_line);
		 }
	     },1000);

Then, I used the following command (as indicated in the example):

perl WebExample.pl myfile.xml

… And… It worked: data were flowing in my terminal! wünderbar! Finally, don’t mess with the installation of Biomart API and configurations of registry files (not tricky at all) if you just want to automatically update your data: use the XML approach with the LWP script. It’s easier, faster and you won’t get error 500 read timeout.

PERL  Don’t dereference if you don’t need…

Sometimes, the PERL syntax could be… obscure, especially if you’re dealing with Hash of pointers to Lists… Let imagine you’re handling a tree structure of a family, for instance. Then, you will have a Hash where the keys are the parents. In this Hash, you will store a pointer to a List, which store the children. Thus, your data file has this structure:

parent1 child1
parent1 child2
child2 child3
...

Then, while reading the file content, you will use the following code:

my @T = split(/\t/, $line);
my $parent = $T[0];
my $child = $T[1];
push (@{$Family{$parent}}, $child);

Thus, to find all the children of one parent, you can use a recursive function:

sub get_children ( $ $ $ $ ) {
	my ($root_parent, $parent, $_Family, $_List) = @_;
	foreach my $child (@{$$_Family{$parent}}) {
		push (@{$$_List{$root_parent}}, $child);
		&get_children ($root_parent, $child, $_Family, $_List);
	}
}

But, don’t produce this code (even if the syntax could more simple):

sub get_children ( $ $ $ $ ) {
	my ($root_parent, $parent, $_Family, $_List) = @_;
	my %Temp =  %$_Family;
	foreach my $child (@{$Temp{$parent}}) {
		push (@{$$_List{$root_parent}}, $child);
		&get_children ($root_parent, $child, $_Family, $_List);
	}
}

In the last version, the line my %Temp = %$_Family dereference the pointer to another Hash. If you’re dealing with small pedigree, it won’t affect the performance of the script. But, if you’re handling (very) large families, the performance will drop down, since at each recursive step, the whole pedigree Hash is dereference to another Hash (meaning memory transferts)…

Let’s benchmark the two different codes with a large pedigree of 33000 individuals (store in %Family) and re-contruct the list of all the children for a parent with an offspring of 600 individuals (kind of families we have to manage in animal genetics):

Pointer version

Time taken was  0 wallclock secs ( 0.00 usr  0.00 sys +  0.00 cusr  0.00 csys =  0.00 CPU) seconds

Dereference version

Time taken was 27 wallclock secs (26.08 usr  0.09 sys +  0.00 cusr  0.00 csys = 26.17 CPU) seconds


Need more comments?

Blablabla  Winner!

Just a short post to let you know that our PhD students won two awards of the festival of the (very) short films for popular sciences of Rennes (jury award and audience award). If you can understand french, let have a look at this video dealing with the concept of QTL and linkage analysis. I’m very proud of them! Nice work, folks! Congratulations to Yuna, Marion, Xiao and Yvan!

truite-poule

La poule et la truite

C/C++  C and Mac

Another programming tutorial aims to people who wants to start learning C language with MacOS X and Xcode. It will help you to correctly set up your Xcode environment and start with C programming. Again, no prior knowledge of programming is required.

The masters of the void

PERL  UNIX & PERL starter kit…

If you’re looking for a nice introductory lesson to learn the basis of UNIX commands and PERL programming aspects, I recommend you to get the PDF file that you would find on the Korf lab website. The lesson is entitled « Unix and Perl Primer for Biologists« . This course is aimed to people « with no prior experience » in either programming or UNIX (as mentionned in the web page). I think it might be usefull for a lot of people. Enjoy…

PERL  Basic stats with PERL

As mentioned in a previous post, I often need to compute simple and basic statistics of my datasets. In another previous post, I’ve performed some benchmarks with several PERL modules for  simple statistic computing (now, you know which modules to use in your script). Finally, to characterize my datasets, I also perform some basic statistical tests, as t test, for instance. Initially, I wrote values in files, loaded them in R and perform a t test using the t.test() function. But, if you have to perform many tests, with methods will be quickly boring, so I undertake to search for a PERL module that does the same thing. One thing I really love about PERL and CPAN is that you will always find a module for your needs! And guess what? I’ve found the Statistics::TTest module…

And second great thing is… that this module is really simple to use, juste type:

use Statistics::TTest;
my @Data1 = ( array of value);
my @Data2 = ( array of value);
my $ttest = new Statistics::TTest;
$ttest->set_significance(95);
$ttest->load_data(\@R_nodup,\@R_alldup);
$ttest->output_t_test();

It will perform a t test between your two dataset (in @Data1 et @Data2) et you will get such output on screen:

*****************************************************

Summary  from the observed values of the sample 1:
	sample size= 547 , degree of freedom=546
	mean=0.0475802079237217 , variance=0.353389091675633
	standard deviation=0.594465383075948 , standard error=0.0254175043571037
	 the estimate of the mean is 0.0475802079237217 +/- 0.0499276038086589
	 or (-0.00234739588493724 to 0.0975078117323805 ) with 95 % of confidence
	 t-statistic=T=1.87194648440864 , Prob >|T|=0.0617439999999999
*****************************************************

Summary  from the observed values of the sample 2:
	sample size= 17 , degree of freedom=16
	mean=0.309487901019442 , variance=0.370557765440336
	standard deviation=0.608734560740834 , standard error=0.147639817170496
	 the estimate of the mean is 0.309487901019442 +/- 0.312981648419734
	 or (-0.00349374740029174 to 0.622469549439176 ) with 95 % of confidence
	 t-statistic=T=2.09623600835297 , Prob >|T|=0.052318
*****************************************************

Comparison of these 2 independent samples.
	 F-statistic=1.04858291941977 , cutoff F-statistic=1.8274 with alpha level=0.05 and  df =(16,546)
	equal variance assumption is accepted(not rejected) since F-statistic < cutoff F-statistic
        degree of freedom=562 , t-statistic=T=1.78772255292605 Prob >|T|=0.07436
	the null hypothesis (the 2 samples have the same mean) is not rejected since the alpha level is 0.05
	difference of the mean=-0.26190769309572, standard error=0.146503545903722
	 the estimate of the difference of the mean is -0.26190769309572 +/- 0.287762264864091
	 or (-0.549669957959811 to 0.0258545717683704 ) with 95 % of confidence

If you are a bit confused by the amount of data printed on screen, you can also access directly to specific values with some methods. For instance, if you wish to print only the t value, df, p-value, you can use the following methods (more methods are available in the CPAN man pages):

use Statistics::TTest;
my @Data1 = ( array of value);
my @Data2 = ( array of value);
my $ttest = new Statistics::TTest;
$ttest->set_significance(95);
$ttest->load_data(\@R_nodup,\@R_alldup);
my $t    = $ttest->t_statistic;
my $df   = $ttest->df;
my $prob = $ttest->{t_prob};
my $test = $ttest->null_hypothesis();
print "t=$t (df = $df) - p-value = $prob\nNull hypothesis is $test\n";

That will give you such output:

t=1.78772255292605 (df = 562) - p-value = 0.07436
Null hypothesis is not rejected

Okay, using t test computation in PERL is nice, but someone could argue me that built-in function in R is quite good, too… And, it’s not so difficult to export data to files… Yep, you’re right… But, this module provides one (really) nice feature: it makes a F-test to check for the equal variance assumption before computing the t test (you know, it’s the var.equal flag sets to FALSE by default in R). Then, the calculation method will be different according to the F test result…

So, finally, this module is fast enough, easy to implement and doesn’t require to export data from PERL scripts.

Blablabla  New theme

Don’t worry, you’re still on iChicken website! Just a simple theme switch. The initial one had been quickly choosen during the install, and, finally, some (minor) points of the old theme didn’t fit very well to the content of some posts. Moreover, I prefer a light and clear theme. Hope it will please you, too.