PERL  And the winner is…

In bioinformatics analysis, I often perform basic statistic computations (mean, median, etc) to describe the data or others calculations (correlation, for instance) to analyse sets of gene expression data… To conduct such analyses, I use some PERL modules to make these basic calculations. Although with small datasets, the choice of a module has no « real » impact in term of time of analysis, it can be relevant to select the right module if the datasets become bigger or if the analysis has to be performed many times…

Depending on the analysis to carry out, I currently use 4 different PERL modules:

  • PDL (see older post for more details)
  • Statistics::Descriptive
  • Statistics::OLS
  • Statistics::Basic

But, all these modules can handle basic calculation, so let’s benchmark!

For benchmarking purpose, I will use a datafile with 100 000 genomic distances and will compute the mean and the root mean square (rms). For correlation computations, I will use two vectors of expression profiling of 139 assays. Here are the benchmark results:

Benchmarking MEAN and STD computation:
  - des : using Statistics::Descriptive module
  - pdl : using PDL module
  - bas : using Statistics::Basic module

Benchmark: timing 1000 iterations of bas, des, pdl...
bas: 47 wallclock secs (45.31 usr +  0.60 sys = 45.91 CPU) @ 21.78/s (n=1000)
des: 176 wallclock secs (171.60 usr +  1.93 sys = 173.53 CPU) @  5.76/s (n=1000)
pdl: 42 wallclock secs (37.87 usr +  2.79 sys = 40.66 CPU) @ 24.59/s (n=1000)
      Rate  des  bas  pdl
des 5.76/s   -- -74% -77%
bas 21.8/s 278%   -- -11%
pdl 24.6/s 327%  13%   --

Benchmarking correlation computation:
  - ols : using Statistics::OLS
  - pdl : using PDL module
  - bas : using Statistics::Basic module

Benchmark: timing 10000 iterations of bas, ols, pdl...
bas:  3 wallclock secs ( 2.98 usr +  0.01 sys =  2.99 CPU) @ 3344.48/s (n=10000)
ols:  7 wallclock secs ( 6.80 usr +  0.02 sys =  6.82 CPU) @ 1466.28/s (n=10000)
pdl:  2 wallclock secs ( 2.54 usr +  0.00 sys =  2.54 CPU) @ 3937.01/s (n=10000)
      Rate  ols  bas  pdl
ols 1466/s   -- -56% -63%
bas 3344/s 128%   -- -15%
pdl 3937/s 169%  18%   --

As you may notice, for basic computations, Statistics::Descriptive is the worse choice… If we benchmark the number of operations done in 5 seconds, we can notice that our script using PDL or Statistics::Basic will perform four times more operations than the one using Statistics::Descriptive!

Benchmark: running bas, des, pdl for at least 5 CPU seconds...
bas:  6 wallclock secs ( 5.21 usr +  0.08 sys =  5.29 CPU) @ 21.74/s (n=115)
des:  5 wallclock secs ( 5.12 usr +  0.06 sys =  5.18 CPU) @  5.79/s (n=30)
pdl:  5 wallclock secs ( 4.83 usr +  0.36 sys =  5.19 CPU) @ 24.66/s (n=128)

Read more »

PERL  Control your jobs!

Let’s explain the situation: you have a lot of analyses to compute on your SMP computer (same analysis on different files, many different analyses or same analysis many times, for simulation purpose, for instance). And, for different reasons, you cannot use or you don’t have access to a cluster with many nodes… First, forget to launch job one by one and forget also to launch all the jobs in one command (since your server may go down)!

You can perform such tasks by writing csh scripts, which will control your jobs and fill the job queue. Maybe you can also install some queue manager on your server. By looking at the web, I didn’t find a “simple” way to solve this problem, so I wrote a simple PERL script, which manages a queue jobs list. This script is based on the Proc::Simple module (available here). The full documentation of this script is available here. Basically, you can choose between 3 analysis mode: many different analyses, same analysis on different input datafiles and same analysis many times. You also have to specify the number of simultaneous jobs to run (i.e. the number of available cpu) and the script will manage your queue job list. I’m not sure it’s the best way but it works for me.

I hope it will be useful and feel free to comment this article.

PERL  Chromosome::Map module

A couple of days ago, I’ve released a PERL module on CPAN, which can generate PNG image of chromosomal maps. You can add several different tracks in the image (chromosomal features, markers, genes, QTL intervals, etc.). In contrary to existing modules which can draw complex images with bioperl objects, this module is quite easy to use and implement in your own script. Furthermore, this module was only designed to handle genetic and chromosomal maps, so you don’t have to read all the Bioperl documentation and a bunch of complex options!

This module is available through the CPAN installer or on the CPAN website. A more “pleasant” documentation, examples and color codes used in the module can be found here. I hope this module would help you.

Unix  XML::SAX conflict between Debian & CPAN

If you have made a previous install of XML::SAX via CPAN and then you try to install a debian package that required this module, you will probably get this error:

Fatal: cannot call XML::SAX->save_parsers_debian().
You probably have a locally installed XML::SAX module.
See /usr/share/doc/libxml-sax-perl/README.Debian.gz

for further details on this error, you can read this bug report page. The soluce is to remove the XML::SAX modules from CPAN. It can be performed by using this script:

#!/usr/bin/perl -w

use ExtUtils::Packlist;
use ExtUtils::Installed;

$ARGV[0] or die "Usage: $0 Module::Name\n";
my $mod = $ARGV[0];
my $inst = ExtUtils::Installed->new();

foreach my $item (sort($inst->files($mod)))
	print "removing $item\n";
	unlink $item;

my $packfile = $inst->packlist($mod)->packlist_file();
print "removing $packfile\n";
unlink $packfile;

Then, type the following commands (root):

chmod u+x rm_perl_mod.pl
./rm_perl_mod.pl XML::SAX
apt-get upgrade

It works for me…

PERL  PDL in the real world

PDL is a great module to work with matrix and algebra computation. It is also quite fast and allow to produce PERL software with nice implementation of matrix computation. On CPAN, we can find a lot of tutorials about the use of the different functions and parameters of PDL. However, most of the examples in CPAN begin with such creations of  a piddle:

 $a = pdl [1..10];             # 1D array
 $a = pdl ([1..10]);           # 1D array
 $a = pdl (1,2,3,4);           # Ditto
 $b = pdl [[1,2,3],[4,5,6]];   # 2D 3x2 array
 $b = pdl 42                   # 0-dimensional scalar
 $c = pdl $a;                  # Make a new copy
 $a = pdl([1,2,3],[4,5,6]);    # 2D
 $a = pdl([[1,2,3],[4,5,6]]);  # 2D

It’s okay to understand the basis of PDL, but, in the real world, we rarely work with such data! First issue when you will begin to use PDL is « how do I put real data into piddle? ». The key is to work with tab reference. In this first example, you want to create a 1D array with data in the list @T:

my $_t = [ @T ];
my $p = pdl ( [@$_t] );

If you want to create a 2D array, the code will be similar:

my $_t1 = [ @T1 ];
my $_t2 = [ @T2 ];
my $p = pdl ( [@$_t1] , [@$_t2] );

Alternatively, you can also use the following instructions:

my $p1 = pdl ( [@$_t1] );
my $p2 = pdl ( [@$_t2] );
my $matrix = cat $p1,$p2;

Now, we can create piddle with real data, but these solutions are still using « manual » piddle initialization. Most of the times, data are read from a file (or SQL table), store in hash (or table) and  the dimension (or the number of  rows) of an array is define during the analysis… If you want to dynamically create a 2D array with several rows, one solution if to create first a 1D array and then to add other rows inside the array.

sub create_matrix ( $ )
	my ($g1) = @_;
	my $matrix = pdl ( [@$g1] );
	return $matrix;

sub update_matrix ( $ $ )
	my ($matrix,$g_new) = @_;
	my $p_new = pdl ( [@$g_new] );
	$matrix = $matrix->glue(1,$p_new);
	return $matrix;

Now, we can use these functions. Let hypothesize that your data are store in a file with this structure:

ID1	value1	value2	...	valueN
ID2	value1	value2	...	valueN
IDn	value1	value2	...	valueN

Your code could look like this:

open (IN, $file) || die "cannot open $file!\n";

my %Data = ();

while ($line = )
	$line =~ s/\s+$//;
	my @T = split (/\t/,$line);
	my $id = shift (@T);
	$Data{$id} = [ @T ];

close (IN);

my $first_row = defined;
my $matrix = undef;

foreach my $id (keys %Data)
	my $_t = $Data{$id};
	if (defined $first_row)
		$first_row = undef;
		$matrix = create_matrix($_t);
	$matrix = update_matrix ($matrix,$_t);

On CPAN, there are many other functions to manage piddle (xchg for instance) but I hope this tutorial has permited you to have a quick start with piddle.

PERL  PERL modules

A reminder post for a quick script that will display the module installed on your system and their version number… Always useful…

#!/usr/bin/perl -w
use strict;
use ExtUtils::Installed;

my $instmod = ExtUtils::Installed->new();

foreach my $module ($instmod->modules())
	my $version = $instmod->version($module) || "unknown version";
	print "$module -- $version\n";

PERL  Smart sorting

To sort alphanumerical and numerical mixed data in PERL using the sort function will not provide you the results that you will expect. If this case, you have to use the cmp command to sort the data… But your numerical data will be sort in this order: 1, 10, 11… which isn’t pretty. Thus, you could use this function:

sub smart_sort ( $ $ )
	my ($a,$b) = @_;
	# Numerical sort
	return ($a <=> $b) if (($a =~ /^\d+$/)&&($b =~ /^\d+$/));
	# Alpha sort
	return ($a cmp $b) if ($a.$b =~ /^\w+$/);
	return ($a cmp $b) if ($a.$b =~ /^\d+\w+$/);
	return ($a cmp $b) if ($a.$b =~ /^\w+\d+$/);

Then, in your code, call this function in this way:

foreach my $data (sort {&smart_sort($a,$b)} keys %HASH)

Now, your mixed data will be sorted as 1, 2, … A, B, C…