février 2010

PERL  Basic stats with PERL

As mentioned in a previous post, I often need to compute simple and basic statistics of my datasets. In another previous post, I’ve performed some benchmarks with several PERL modules for  simple statistic computing (now, you know which modules to use in your script). Finally, to characterize my datasets, I also perform some basic statistical tests, as t test, for instance. Initially, I wrote values in files, loaded them in R and perform a t test using the t.test() function. But, if you have to perform many tests, with methods will be quickly boring, so I undertake to search for a PERL module that does the same thing. One thing I really love about PERL and CPAN is that you will always find a module for your needs! And guess what? I’ve found the Statistics::TTest module…

And second great thing is… that this module is really simple to use, juste type:

use Statistics::TTest;
my @Data1 = ( array of value);
my @Data2 = ( array of value);
my $ttest = new Statistics::TTest;

It will perform a t test between your two dataset (in @Data1 et @Data2) et you will get such output on screen:


Summary  from the observed values of the sample 1:
	sample size= 547 , degree of freedom=546
	mean=0.0475802079237217 , variance=0.353389091675633
	standard deviation=0.594465383075948 , standard error=0.0254175043571037
	 the estimate of the mean is 0.0475802079237217 +/- 0.0499276038086589
	 or (-0.00234739588493724 to 0.0975078117323805 ) with 95 % of confidence
	 t-statistic=T=1.87194648440864 , Prob >|T|=0.0617439999999999

Summary  from the observed values of the sample 2:
	sample size= 17 , degree of freedom=16
	mean=0.309487901019442 , variance=0.370557765440336
	standard deviation=0.608734560740834 , standard error=0.147639817170496
	 the estimate of the mean is 0.309487901019442 +/- 0.312981648419734
	 or (-0.00349374740029174 to 0.622469549439176 ) with 95 % of confidence
	 t-statistic=T=2.09623600835297 , Prob >|T|=0.052318

Comparison of these 2 independent samples.
	 F-statistic=1.04858291941977 , cutoff F-statistic=1.8274 with alpha level=0.05 and  df =(16,546)
	equal variance assumption is accepted(not rejected) since F-statistic < cutoff F-statistic
        degree of freedom=562 , t-statistic=T=1.78772255292605 Prob >|T|=0.07436
	the null hypothesis (the 2 samples have the same mean) is not rejected since the alpha level is 0.05
	difference of the mean=-0.26190769309572, standard error=0.146503545903722
	 the estimate of the difference of the mean is -0.26190769309572 +/- 0.287762264864091
	 or (-0.549669957959811 to 0.0258545717683704 ) with 95 % of confidence

If you are a bit confused by the amount of data printed on screen, you can also access directly to specific values with some methods. For instance, if you wish to print only the t value, df, p-value, you can use the following methods (more methods are available in the CPAN man pages):

use Statistics::TTest;
my @Data1 = ( array of value);
my @Data2 = ( array of value);
my $ttest = new Statistics::TTest;
my $t    = $ttest->t_statistic;
my $df   = $ttest->df;
my $prob = $ttest->{t_prob};
my $test = $ttest->null_hypothesis();
print "t=$t (df = $df) - p-value = $prob\nNull hypothesis is $test\n";

That will give you such output:

t=1.78772255292605 (df = 562) - p-value = 0.07436
Null hypothesis is not rejected

Okay, using t test computation in PERL is nice, but someone could argue me that built-in function in R is quite good, too… And, it’s not so difficult to export data to files… Yep, you’re right… But, this module provides one (really) nice feature: it makes a F-test to check for the equal variance assumption before computing the t test (you know, it’s the var.equal flag sets to FALSE by default in R). Then, the calculation method will be different according to the F test result…

So, finally, this module is fast enough, easy to implement and doesn’t require to export data from PERL scripts.

Blablabla  New theme

Don’t worry, you’re still on iChicken website! Just a simple theme switch. The initial one had been quickly choosen during the install, and, finally, some (minor) points of the old theme didn’t fit very well to the content of some posts. Moreover, I prefer a light and clear theme. Hope it will please you, too.

R  Histograms with multiple data series

To be honest, I have to admit that I don’t like the R software. But, I have to admit that this software could be helpful, too! I’m a totally beginner with this language and I don’t like it because I could find a nice and simple tutorial on it. I’m definitively lost with the R data formats. I can’t understand the differences between vector, list, factor, array, matrix or dataframe (and I didn’t find the good description of these fundamental concepts). Moreover, I didn’t like (euphemism) the way R is managing the data:

  • try to load the file with alphanumerical values and R will consider them as factor (even I don’t want my data as factor),
  • load your single column datafile with read.table function and ask R for a mean computation and histogram drawing: the mean calculation will be OK but for hist function, you have to transpose your data,
  • load again your single column datafile with read.table and try to compute a mean, R will return a ‘NaN’ value (why not, my data are maybe too close from zero or infinite). But, on the same object, invoke the summary function and you will get a result for the mean value (uuuh?)
  • again: single column datafile (I’m too beginner with R to try to handle more complex datafile ;-) !) loaded in R with read.table. This time mean calculation is OK but invoke the median function and you will get an error message telling you that your data are corrupted (???)… Let’s use your magic wand and transpose your data (again) and try again the median function. Wonderful, it works!!! What’s wrong with R? It can compute a mean value with my data but needs a transpose of the same data to compute a median? It seems a bit illogical…

I have dozen of examples like these! Okay, I’m a beginner with R and I’m definitively not good with this software! But, to my mind, the way of programming with R is the opposite of the good practice of coding (and it’s a PERL coder who says that!):

  • Don’t modify my data
  • Let me handle the format of my data
  • If data format is good for one task, it should be okay for a similar one

Anyway, I have to admit that I’m a bit dishonest with R and that I also need this software for some task easier with R than with PERL (remember one of the quality of PERL coder: laziness ;-) !). Initially, the purpose of this article was to talk about drawing histograms with multiple data series with R, so let’s do it!

I’ve found two ways for drawing such histograms with R: the first is explained here and for the second… I ask my PhD student if she knows how to do it with R (!)

For the multhist function, I’ve spent some times to put my data in the right format! I’ve always loved tutorials that explain functions with random data! Who analyses random data? Anyway, I finally succeed (great!):


data1 <- read.table("data1",header = FALSE)
data2 <- read.table("data2",header = FALSE)
data3 <- read.table("data3",header = FALSE)
data4 <- read.table("data4",header = FALSE)

data1 <- data1[,1]
data2 <- data2[,1]
data3 <- data3[,1]
data4 <- data4[,1]

l <- list (data1,data2,data3,data4)

and here’s the histogram!

But, despite the “additional arguments to hist or barplot” in the arguments section of the man page of the multhist function, I can neither change the breaks of the histgrams (breaks=brk which is a vector (?) of the break values), nor add a legend to the graph…
Read more »

PERL  And the winner is…

In bioinformatics analysis, I often perform basic statistic computations (mean, median, etc) to describe the data or others calculations (correlation, for instance) to analyse sets of gene expression data… To conduct such analyses, I use some PERL modules to make these basic calculations. Although with small datasets, the choice of a module has no « real » impact in term of time of analysis, it can be relevant to select the right module if the datasets become bigger or if the analysis has to be performed many times…

Depending on the analysis to carry out, I currently use 4 different PERL modules:

  • PDL (see older post for more details)
  • Statistics::Descriptive
  • Statistics::OLS
  • Statistics::Basic

But, all these modules can handle basic calculation, so let’s benchmark!

For benchmarking purpose, I will use a datafile with 100 000 genomic distances and will compute the mean and the root mean square (rms). For correlation computations, I will use two vectors of expression profiling of 139 assays. Here are the benchmark results:

Benchmarking MEAN and STD computation:
  - des : using Statistics::Descriptive module
  - pdl : using PDL module
  - bas : using Statistics::Basic module

Benchmark: timing 1000 iterations of bas, des, pdl...
bas: 47 wallclock secs (45.31 usr +  0.60 sys = 45.91 CPU) @ 21.78/s (n=1000)
des: 176 wallclock secs (171.60 usr +  1.93 sys = 173.53 CPU) @  5.76/s (n=1000)
pdl: 42 wallclock secs (37.87 usr +  2.79 sys = 40.66 CPU) @ 24.59/s (n=1000)
      Rate  des  bas  pdl
des 5.76/s   -- -74% -77%
bas 21.8/s 278%   -- -11%
pdl 24.6/s 327%  13%   --

Benchmarking correlation computation:
  - ols : using Statistics::OLS
  - pdl : using PDL module
  - bas : using Statistics::Basic module

Benchmark: timing 10000 iterations of bas, ols, pdl...
bas:  3 wallclock secs ( 2.98 usr +  0.01 sys =  2.99 CPU) @ 3344.48/s (n=10000)
ols:  7 wallclock secs ( 6.80 usr +  0.02 sys =  6.82 CPU) @ 1466.28/s (n=10000)
pdl:  2 wallclock secs ( 2.54 usr +  0.00 sys =  2.54 CPU) @ 3937.01/s (n=10000)
      Rate  ols  bas  pdl
ols 1466/s   -- -56% -63%
bas 3344/s 128%   -- -15%
pdl 3937/s 169%  18%   --

As you may notice, for basic computations, Statistics::Descriptive is the worse choice… If we benchmark the number of operations done in 5 seconds, we can notice that our script using PDL or Statistics::Basic will perform four times more operations than the one using Statistics::Descriptive!

Benchmark: running bas, des, pdl for at least 5 CPU seconds...
bas:  6 wallclock secs ( 5.21 usr +  0.08 sys =  5.29 CPU) @ 21.74/s (n=115)
des:  5 wallclock secs ( 5.12 usr +  0.06 sys =  5.18 CPU) @  5.79/s (n=30)
pdl:  5 wallclock secs ( 4.83 usr +  0.36 sys =  5.19 CPU) @ 24.66/s (n=128)

Read more »

PERL  Control your jobs!

Let’s explain the situation: you have a lot of analyses to compute on your SMP computer (same analysis on different files, many different analyses or same analysis many times, for simulation purpose, for instance). And, for different reasons, you cannot use or you don’t have access to a cluster with many nodes… First, forget to launch job one by one and forget also to launch all the jobs in one command (since your server may go down)!

You can perform such tasks by writing csh scripts, which will control your jobs and fill the job queue. Maybe you can also install some queue manager on your server. By looking at the web, I didn’t find a “simple” way to solve this problem, so I wrote a simple PERL script, which manages a queue jobs list. This script is based on the Proc::Simple module (available here). The full documentation of this script is available here. Basically, you can choose between 3 analysis mode: many different analyses, same analysis on different input datafiles and same analysis many times. You also have to specify the number of simultaneous jobs to run (i.e. the number of available cpu) and the script will manage your queue job list. I’m not sure it’s the best way but it works for me.

I hope it will be useful and feel free to comment this article.