In bioinformatics analysis, I often perform basic statistic computations (mean, median, etc) to describe the data or others calculations (correlation, for instance) to analyse sets of gene expression data… To conduct such analyses, I use some PERL modules to make these basic calculations. Although with small datasets, the choice of a module has no « real » impact in term of time of analysis, it can be relevant to select the right module if the datasets become bigger or if the analysis has to be performed many times…

Depending on the analysis to carry out, I currently use 4 different PERL modules:

- PDL (see older post for more details)
- Statistics::Descriptive
- Statistics::OLS
- Statistics::Basic

But, all these modules can handle basic calculation, so let’s benchmark!

For benchmarking purpose, I will use a datafile with 100 000 genomic distances and will compute the mean and the root mean square (rms). For correlation computations, I will use two vectors of expression profiling of 139 assays. Here are the benchmark results:

Benchmarking MEAN and STD computation: - des : using Statistics::Descriptive module - pdl : using PDL module - bas : using Statistics::Basic module Benchmark: timing 1000 iterations of bas, des, pdl... bas: 47 wallclock secs (45.31 usr + 0.60 sys = 45.91 CPU) @ 21.78/s (n=1000) des: 176 wallclock secs (171.60 usr + 1.93 sys = 173.53 CPU) @ 5.76/s (n=1000) pdl: 42 wallclock secs (37.87 usr + 2.79 sys = 40.66 CPU) @ 24.59/s (n=1000) Rate des bas pdl des 5.76/s -- -74% -77% bas 21.8/s 278% -- -11% pdl 24.6/s 327% 13% -- -- Benchmarking correlation computation: - ols : using Statistics::OLS - pdl : using PDL module - bas : using Statistics::Basic module Benchmark: timing 10000 iterations of bas, ols, pdl... bas: 3 wallclock secs ( 2.98 usr + 0.01 sys = 2.99 CPU) @ 3344.48/s (n=10000) ols: 7 wallclock secs ( 6.80 usr + 0.02 sys = 6.82 CPU) @ 1466.28/s (n=10000) pdl: 2 wallclock secs ( 2.54 usr + 0.00 sys = 2.54 CPU) @ 3937.01/s (n=10000) Rate ols bas pdl ols 1466/s -- -56% -63% bas 3344/s 128% -- -15% pdl 3937/s 169% 18% --

As you may notice, for basic computations, Statistics::Descriptive is the worse choice… If we benchmark the number of operations done in 5 seconds, we can notice that our script using PDL or Statistics::Basic will perform four times more operations than the one using Statistics::Descriptive!

Benchmark: running bas, des, pdl for at least 5 CPU seconds... bas: 6 wallclock secs ( 5.21 usr + 0.08 sys = 5.29 CPU) @ 21.74/s (n=115) des: 5 wallclock secs ( 5.12 usr + 0.06 sys = 5.18 CPU) @ 5.79/s (n=30) pdl: 5 wallclock secs ( 4.83 usr + 0.36 sys = 5.19 CPU) @ 24.66/s (n=128)