Tag: code

PERL  multi-threading with PERL

Multi-threading, forking… At first, it seems to be complicated… Actually, it can be quite simple ! First, one thing you should now: your PERL installation may not support threads (this option has been set during compilation), so check it out:

perl -V
Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.30.5-dsa-amd64, archname=x86_64-linux-gnu-thread-multi
    uname='linux brahms 2.6.30.5-dsa-amd64 #1 smp mon aug 17 02:18:43 cest 2009 x86_64 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef

Here, you should look at the following options: useithreads=define. If your PERL install doesn’t have this option, I recommend you to install the forsk.pm module. During installation, the CPAN installer will ask you:

It appears your perl was not built with native ithreads.

Would you like to create references to forks, such that
using 'use threads' and 'use threads::shared' will quietly
load forks and forks::shared? [no]

I recommend you to choose yes at this point: this will permit you to develop programs with the use threads directive, whatever this options has been set during PERL compilation (note: be aware that the forks module is not as fast as the native threads options.

Then, how do we develop this threads? You can find a lot of tutorials with google, like this one, for instance. But, there’s some issues with these examples: there are all based on loops with the fixed number of iterations (do something 10 times and thread it). Okay, that’s nice for beginners. But, again, in the real world, it’s not in this way that things happen! Most of the time, you have a lot of basic operations to perform (let’s say about 500) and you don’t want to perform all the operations at the same time (as your system may go down). How to develop a system-friendly program that will only launch a limited number of threads and will perform all the tasks? That’s the point! And, again, trust me, I didn’t find any tutorials for this « real-world » case. Here I propose a program that may help you to achieve this. The algorithm is based on a while loop that compares the number of task to perform and the number of running and done threads. It can be easily adapt to any cases. Let’s say that you have 100 nucleotide sequences to analyze. These sequences are stored in a Hash table. Then, you can get the number of entry in your hash (your $nb_compute) and enter in the loop. Now, let’s stop talking and have a look at the code:

#!/opt/local/bin/perl -w
use threads;
use strict;
use warnings;

my @a = ();
my @b = ();

sub sleeping_sub ( $ $ $ );

print "Starting main program\n";

my $nb_process = 10;
my $nb_compute = 20;
my $i=0;
my @running = ();
my @Threads;
while (scalar @Threads < $nb_compute) {
 	@running = threads->list(threads::running);
	print "LOOP $i\n";
	print "  - BEGIN LOOP >> NB running threads = ".(scalar @running)."\n";

	if (scalar @running < $nb_process) {
 		my $thread = threads->new( sub { sleeping_sub($i, \@a, \@b) });
		push (@Threads, $thread);
		my $tid = $thread->tid;
		print "  - starting thread $tid\n";
	}
	@running = threads->list(threads::running);
	print "  - AFTER STARTING >> NB running Threads = ".(scalar @running)."\n";
	foreach my $thr (@Threads) {
		if ($thr->is_running()) {
			my $tid = $thr->tid;
			print "  - Thread $tid running\n";
		}
		elsif ($thr->is_joinable()) {
			my $tid = $thr->tid;
			$thr->join;
			print "  - Results for thread $tid:\n";
			print "  - Thread $tid has been joined\n";
		}
	}

	@running = threads->list(threads::running);
	print "  - END LOOP >> NB Threads = ".(scalar @running)."\n";
	$i++;
}
print "\nJOINING pending threads\n";
while (scalar @running != 0) {
	foreach my $thr (@Threads) {
		$thr->join if ($thr->is_joinable());
	}
	@running = threads->list(threads::running);
}

print "NB started threads = ".(scalar @Threads)."\n";
print "End of main program\n";

sub sleeping_sub ( $ $ $ ) {
	sleep(4);
}

During the main loop, the program will start new threads if the number of running threads is lower than the number of max threads. Still during this loop, it will join pending threads. Then, at the end of the loop, you must be aware that some threads may still running, so, another loop will join the last running threads. You should note that the parameters of the sub are not used (it’s just for the example), but you can send parameters to your favorite sub and get the results, too. To get more details about the shared data, I recommend you to read the threads perldoc. I hope it will help.

PERL  Who’s the boss?

Do you remember the old post where I compared the performance between different PERL modules for basic statistic? In the former, the winner was PDL. But, this time, for particular reasons, I’ve experimented to put some C code into a PERL program by using the Inline::C module. Again, the principle is quite simple: compute Pearson coefficient. Benchmark results are quite… impressive :

Benchmark: running C, bas, pdl for at least 5 CPU seconds...
         C:  5 wallclock secs ( 5.25 usr +  0.01 sys =  5.26 CPU) @ 99364.26/s (n=522656)
       bas:  6 wallclock secs ( 5.24 usr +  0.01 sys =  5.25 CPU) @ 3256.19/s (n=17095)
       pdl:  5 wallclock secs ( 5.30 usr +  0.01 sys =  5.31 CPU) @ 3855.93/s (n=20475)
       Rate   bas   pdl     C
bas  3256/s    --  -16%  -97%
pdl  3856/s   18%    --  -96%
C   99364/s 2952% 2477%    --

Ouch ! Say no more… The source code could be find here (be polite, this is my first C code into a PERL program ;-) !)…

PERL  Flush file output

If you want to write something to a file (i.e write to a log file), PERL will handle a buffer and won’t write instantly to the file. In most case, this behaviour is the best (since writing to a file will downgrade performance of your program). But, it is not what you will expect when you want to write to a logfile (as your program may stop, leaving the file empty). So, in this case, you should use binmode to give your filehandle a unix layer, which is unbuffered :

open (LOG, ">".$filename) || die "cannot create file\n";
binmode (LOG, ":unix");

Then, your PERL program will write instantly to your logfile…

PERL  Don’t dereference if you don’t need…

Sometimes, the PERL syntax could be… obscure, especially if you’re dealing with Hash of pointers to Lists… Let imagine you’re handling a tree structure of a family, for instance. Then, you will have a Hash where the keys are the parents. In this Hash, you will store a pointer to a List, which store the children. Thus, your data file has this structure:

parent1 child1
parent1 child2
child2 child3
...

Then, while reading the file content, you will use the following code:

my @T = split(/\t/, $line);
my $parent = $T[0];
my $child = $T[1];
push (@{$Family{$parent}}, $child);

Thus, to find all the children of one parent, you can use a recursive function:

sub get_children ( $ $ $ $ ) {
	my ($root_parent, $parent, $_Family, $_List) = @_;
	foreach my $child (@{$$_Family{$parent}}) {
		push (@{$$_List{$root_parent}}, $child);
		&get_children ($root_parent, $child, $_Family, $_List);
	}
}

But, don’t produce this code (even if the syntax could more simple):

sub get_children ( $ $ $ $ ) {
	my ($root_parent, $parent, $_Family, $_List) = @_;
	my %Temp =  %$_Family;
	foreach my $child (@{$Temp{$parent}}) {
		push (@{$$_List{$root_parent}}, $child);
		&get_children ($root_parent, $child, $_Family, $_List);
	}
}

In the last version, the line my %Temp = %$_Family dereference the pointer to another Hash. If you’re dealing with small pedigree, it won’t affect the performance of the script. But, if you’re handling (very) large families, the performance will drop down, since at each recursive step, the whole pedigree Hash is dereference to another Hash (meaning memory transferts)…

Let’s benchmark the two different codes with a large pedigree of 33000 individuals (store in %Family) and re-contruct the list of all the children for a parent with an offspring of 600 individuals (kind of families we have to manage in animal genetics):

Pointer version

Time taken was  0 wallclock secs ( 0.00 usr  0.00 sys +  0.00 cusr  0.00 csys =  0.00 CPU) seconds

Dereference version

Time taken was 27 wallclock secs (26.08 usr  0.09 sys +  0.00 cusr  0.00 csys = 26.17 CPU) seconds


Need more comments?

C/C++  C and Mac

Another programming tutorial aims to people who wants to start learning C language with MacOS X and Xcode. It will help you to correctly set up your Xcode environment and start with C programming. Again, no prior knowledge of programming is required.

The masters of the void

PERL  Basic stats with PERL

As mentioned in a previous post, I often need to compute simple and basic statistics of my datasets. In another previous post, I’ve performed some benchmarks with several PERL modules for  simple statistic computing (now, you know which modules to use in your script). Finally, to characterize my datasets, I also perform some basic statistical tests, as t test, for instance. Initially, I wrote values in files, loaded them in R and perform a t test using the t.test() function. But, if you have to perform many tests, with methods will be quickly boring, so I undertake to search for a PERL module that does the same thing. One thing I really love about PERL and CPAN is that you will always find a module for your needs! And guess what? I’ve found the Statistics::TTest module…

And second great thing is… that this module is really simple to use, juste type:

use Statistics::TTest;
my @Data1 = ( array of value);
my @Data2 = ( array of value);
my $ttest = new Statistics::TTest;
$ttest->set_significance(95);
$ttest->load_data(\@R_nodup,\@R_alldup);
$ttest->output_t_test();

It will perform a t test between your two dataset (in @Data1 et @Data2) et you will get such output on screen:

*****************************************************

Summary  from the observed values of the sample 1:
	sample size= 547 , degree of freedom=546
	mean=0.0475802079237217 , variance=0.353389091675633
	standard deviation=0.594465383075948 , standard error=0.0254175043571037
	 the estimate of the mean is 0.0475802079237217 +/- 0.0499276038086589
	 or (-0.00234739588493724 to 0.0975078117323805 ) with 95 % of confidence
	 t-statistic=T=1.87194648440864 , Prob >|T|=0.0617439999999999
*****************************************************

Summary  from the observed values of the sample 2:
	sample size= 17 , degree of freedom=16
	mean=0.309487901019442 , variance=0.370557765440336
	standard deviation=0.608734560740834 , standard error=0.147639817170496
	 the estimate of the mean is 0.309487901019442 +/- 0.312981648419734
	 or (-0.00349374740029174 to 0.622469549439176 ) with 95 % of confidence
	 t-statistic=T=2.09623600835297 , Prob >|T|=0.052318
*****************************************************

Comparison of these 2 independent samples.
	 F-statistic=1.04858291941977 , cutoff F-statistic=1.8274 with alpha level=0.05 and  df =(16,546)
	equal variance assumption is accepted(not rejected) since F-statistic < cutoff F-statistic
        degree of freedom=562 , t-statistic=T=1.78772255292605 Prob >|T|=0.07436
	the null hypothesis (the 2 samples have the same mean) is not rejected since the alpha level is 0.05
	difference of the mean=-0.26190769309572, standard error=0.146503545903722
	 the estimate of the difference of the mean is -0.26190769309572 +/- 0.287762264864091
	 or (-0.549669957959811 to 0.0258545717683704 ) with 95 % of confidence

If you are a bit confused by the amount of data printed on screen, you can also access directly to specific values with some methods. For instance, if you wish to print only the t value, df, p-value, you can use the following methods (more methods are available in the CPAN man pages):

use Statistics::TTest;
my @Data1 = ( array of value);
my @Data2 = ( array of value);
my $ttest = new Statistics::TTest;
$ttest->set_significance(95);
$ttest->load_data(\@R_nodup,\@R_alldup);
my $t    = $ttest->t_statistic;
my $df   = $ttest->df;
my $prob = $ttest->{t_prob};
my $test = $ttest->null_hypothesis();
print "t=$t (df = $df) - p-value = $prob\nNull hypothesis is $test\n";

That will give you such output:

t=1.78772255292605 (df = 562) - p-value = 0.07436
Null hypothesis is not rejected

Okay, using t test computation in PERL is nice, but someone could argue me that built-in function in R is quite good, too… And, it’s not so difficult to export data to files… Yep, you’re right… But, this module provides one (really) nice feature: it makes a F-test to check for the equal variance assumption before computing the t test (you know, it’s the var.equal flag sets to FALSE by default in R). Then, the calculation method will be different according to the F test result…

So, finally, this module is fast enough, easy to implement and doesn’t require to export data from PERL scripts.

PERL  And the winner is…

In bioinformatics analysis, I often perform basic statistic computations (mean, median, etc) to describe the data or others calculations (correlation, for instance) to analyse sets of gene expression data… To conduct such analyses, I use some PERL modules to make these basic calculations. Although with small datasets, the choice of a module has no « real » impact in term of time of analysis, it can be relevant to select the right module if the datasets become bigger or if the analysis has to be performed many times…

Depending on the analysis to carry out, I currently use 4 different PERL modules:

  • PDL (see older post for more details)
  • Statistics::Descriptive
  • Statistics::OLS
  • Statistics::Basic

But, all these modules can handle basic calculation, so let’s benchmark!

For benchmarking purpose, I will use a datafile with 100 000 genomic distances and will compute the mean and the root mean square (rms). For correlation computations, I will use two vectors of expression profiling of 139 assays. Here are the benchmark results:

Benchmarking MEAN and STD computation:
  - des : using Statistics::Descriptive module
  - pdl : using PDL module
  - bas : using Statistics::Basic module

Benchmark: timing 1000 iterations of bas, des, pdl...
bas: 47 wallclock secs (45.31 usr +  0.60 sys = 45.91 CPU) @ 21.78/s (n=1000)
des: 176 wallclock secs (171.60 usr +  1.93 sys = 173.53 CPU) @  5.76/s (n=1000)
pdl: 42 wallclock secs (37.87 usr +  2.79 sys = 40.66 CPU) @ 24.59/s (n=1000)
      Rate  des  bas  pdl
des 5.76/s   -- -74% -77%
bas 21.8/s 278%   -- -11%
pdl 24.6/s 327%  13%   --

--
Benchmarking correlation computation:
  - ols : using Statistics::OLS
  - pdl : using PDL module
  - bas : using Statistics::Basic module

Benchmark: timing 10000 iterations of bas, ols, pdl...
bas:  3 wallclock secs ( 2.98 usr +  0.01 sys =  2.99 CPU) @ 3344.48/s (n=10000)
ols:  7 wallclock secs ( 6.80 usr +  0.02 sys =  6.82 CPU) @ 1466.28/s (n=10000)
pdl:  2 wallclock secs ( 2.54 usr +  0.00 sys =  2.54 CPU) @ 3937.01/s (n=10000)
      Rate  ols  bas  pdl
ols 1466/s   -- -56% -63%
bas 3344/s 128%   -- -15%
pdl 3937/s 169%  18%   --

As you may notice, for basic computations, Statistics::Descriptive is the worse choice… If we benchmark the number of operations done in 5 seconds, we can notice that our script using PDL or Statistics::Basic will perform four times more operations than the one using Statistics::Descriptive!

Benchmark: running bas, des, pdl for at least 5 CPU seconds...
bas:  6 wallclock secs ( 5.21 usr +  0.08 sys =  5.29 CPU) @ 21.74/s (n=115)
des:  5 wallclock secs ( 5.12 usr +  0.06 sys =  5.18 CPU) @  5.79/s (n=30)
pdl:  5 wallclock secs ( 4.83 usr +  0.36 sys =  5.19 CPU) @ 24.66/s (n=128)

Read more »

PERL  Control your jobs!

Let’s explain the situation: you have a lot of analyses to compute on your SMP computer (same analysis on different files, many different analyses or same analysis many times, for simulation purpose, for instance). And, for different reasons, you cannot use or you don’t have access to a cluster with many nodes… First, forget to launch job one by one and forget also to launch all the jobs in one command (since your server may go down)!

You can perform such tasks by writing csh scripts, which will control your jobs and fill the job queue. Maybe you can also install some queue manager on your server. By looking at the web, I didn’t find a “simple” way to solve this problem, so I wrote a simple PERL script, which manages a queue jobs list. This script is based on the Proc::Simple module (available here). The full documentation of this script is available here. Basically, you can choose between 3 analysis mode: many different analyses, same analysis on different input datafiles and same analysis many times. You also have to specify the number of simultaneous jobs to run (i.e. the number of available cpu) and the script will manage your queue job list. I’m not sure it’s the best way but it works for me.

I hope it will be useful and feel free to comment this article.

PERL  Chromosome::Map module

A couple of days ago, I’ve released a PERL module on CPAN, which can generate PNG image of chromosomal maps. You can add several different tracks in the image (chromosomal features, markers, genes, QTL intervals, etc.). In contrary to existing modules which can draw complex images with bioperl objects, this module is quite easy to use and implement in your own script. Furthermore, this module was only designed to handle genetic and chromosomal maps, so you don’t have to read all the Bioperl documentation and a bunch of complex options!

This module is available through the CPAN installer or on the CPAN website. A more “pleasant” documentation, examples and color codes used in the module can be found here. I hope this module would help you.

PERL  PDL in the real world

PDL is a great module to work with matrix and algebra computation. It is also quite fast and allow to produce PERL software with nice implementation of matrix computation. On CPAN, we can find a lot of tutorials about the use of the different functions and parameters of PDL. However, most of the examples in CPAN begin with such creations of  a piddle:

 $a = pdl [1..10];             # 1D array
 $a = pdl ([1..10]);           # 1D array
 $a = pdl (1,2,3,4);           # Ditto
 $b = pdl [[1,2,3],[4,5,6]];   # 2D 3x2 array
 $b = pdl 42                   # 0-dimensional scalar
 $c = pdl $a;                  # Make a new copy
 $a = pdl([1,2,3],[4,5,6]);    # 2D
 $a = pdl([[1,2,3],[4,5,6]]);  # 2D

It’s okay to understand the basis of PDL, but, in the real world, we rarely work with such data! First issue when you will begin to use PDL is « how do I put real data into piddle? ». The key is to work with tab reference. In this first example, you want to create a 1D array with data in the list @T:

my $_t = [ @T ];
my $p = pdl ( [@$_t] );

If you want to create a 2D array, the code will be similar:

my $_t1 = [ @T1 ];
my $_t2 = [ @T2 ];
my $p = pdl ( [@$_t1] , [@$_t2] );

Alternatively, you can also use the following instructions:

my $p1 = pdl ( [@$_t1] );
my $p2 = pdl ( [@$_t2] );
my $matrix = cat $p1,$p2;

Now, we can create piddle with real data, but these solutions are still using « manual » piddle initialization. Most of the times, data are read from a file (or SQL table), store in hash (or table) and  the dimension (or the number of  rows) of an array is define during the analysis… If you want to dynamically create a 2D array with several rows, one solution if to create first a 1D array and then to add other rows inside the array.

sub create_matrix ( $ )
{
	my ($g1) = @_;
	my $matrix = pdl ( [@$g1] );
	return $matrix;
}

sub update_matrix ( $ $ )
{
	my ($matrix,$g_new) = @_;
	my $p_new = pdl ( [@$g_new] );
	$matrix = $matrix->glue(1,$p_new);
	return $matrix;
}

Now, we can use these functions. Let hypothesize that your data are store in a file with this structure:

ID1	value1	value2	...	valueN
ID2	value1	value2	...	valueN
.
.
.
IDn	value1	value2	...	valueN

Your code could look like this:

open (IN, $file) || die "cannot open $file!\n";

my %Data = ();

while ($line = )
{
	$line =~ s/\s+$//;
	my @T = split (/\t/,$line);
	my $id = shift (@T);
	$Data{$id} = [ @T ];
}

close (IN);

my $first_row = defined;
my $matrix = undef;

foreach my $id (keys %Data)
{
	my $_t = $Data{$id};
	if (defined $first_row)
	{
		$first_row = undef;
		$matrix = create_matrix($_t);
	}
	$matrix = update_matrix ($matrix,$_t);
}

On CPAN, there are many other functions to manage piddle (xchg for instance) but I hope this tutorial has permited you to have a quick start with piddle.