In bioinformatics analysis, I often perform basic statistic computations (mean, median, etc) to describe the data or others calculations (correlation, for instance) to analyse sets of gene expression data… To conduct such analyses, I use some PERL modules to make these basic calculations. Although with small datasets, the choice of a module has no « real » impact in term of time of analysis, it can be relevant to select the right module if the datasets become bigger or if the analysis has to be performed many times…
Depending on the analysis to carry out, I currently use 4 different PERL modules:
- PDL (see older post for more details)
But, all these modules can handle basic calculation, so let’s benchmark!
For benchmarking purpose, I will use a datafile with 100 000 genomic distances and will compute the mean and the root mean square (rms). For correlation computations, I will use two vectors of expression profiling of 139 assays. Here are the benchmark results:
Benchmarking MEAN and STD computation: - des : using Statistics::Descriptive module - pdl : using PDL module - bas : using Statistics::Basic module Benchmark: timing 1000 iterations of bas, des, pdl... bas: 47 wallclock secs (45.31 usr + 0.60 sys = 45.91 CPU) @ 21.78/s (n=1000) des: 176 wallclock secs (171.60 usr + 1.93 sys = 173.53 CPU) @ 5.76/s (n=1000) pdl: 42 wallclock secs (37.87 usr + 2.79 sys = 40.66 CPU) @ 24.59/s (n=1000) Rate des bas pdl des 5.76/s -- -74% -77% bas 21.8/s 278% -- -11% pdl 24.6/s 327% 13% -- -- Benchmarking correlation computation: - ols : using Statistics::OLS - pdl : using PDL module - bas : using Statistics::Basic module Benchmark: timing 10000 iterations of bas, ols, pdl... bas: 3 wallclock secs ( 2.98 usr + 0.01 sys = 2.99 CPU) @ 3344.48/s (n=10000) ols: 7 wallclock secs ( 6.80 usr + 0.02 sys = 6.82 CPU) @ 1466.28/s (n=10000) pdl: 2 wallclock secs ( 2.54 usr + 0.00 sys = 2.54 CPU) @ 3937.01/s (n=10000) Rate ols bas pdl ols 1466/s -- -56% -63% bas 3344/s 128% -- -15% pdl 3937/s 169% 18% --
As you may notice, for basic computations, Statistics::Descriptive is the worse choice… If we benchmark the number of operations done in 5 seconds, we can notice that our script using PDL or Statistics::Basic will perform four times more operations than the one using Statistics::Descriptive!
Benchmark: running bas, des, pdl for at least 5 CPU seconds... bas: 6 wallclock secs ( 5.21 usr + 0.08 sys = 5.29 CPU) @ 21.74/s (n=115) des: 5 wallclock secs ( 5.12 usr + 0.06 sys = 5.18 CPU) @ 5.79/s (n=30) pdl: 5 wallclock secs ( 4.83 usr + 0.36 sys = 5.19 CPU) @ 24.66/s (n=128)
For correlation computation, we can observe the same results: PDL is the fastest, followed by Statistics::Basic (very close) and Statistics::OLS is the slowest (I didn’t benchmark Statistics::Descriptive since the covariance routine isn’t implemented). Again, if we benchmark the number of operations done in 5 seconds, PDL and Statistics::Basic are more than 2 times faster than Statistics::OLS…
Benchmark: running bas, ols, pdl for at least 5 CPU seconds... bas: 6 wallclock secs ( 5.30 usr + 0.02 sys = 5.32 CPU) @ 3334.77/s (n=17741) ols: 5 wallclock secs ( 5.27 usr + 0.02 sys = 5.29 CPU) @ 1479.21/s (n=7825) pdl: 6 wallclock secs ( 5.27 usr + 0.03 sys = 5.30 CPU) @ 3863.21/s (n=20475)
Finally, over all the benchmarks, we can notice that PDL is the fastest, but Statistics::Basic has results close to PDL. Now, choice is yours! In many cases, PDL is the fastest but it can also be more difficult to implement and to use. So, if you’re looking for the best performances, I recommend using PDL (Who says « try another language? !). If the performance is not the most important thing to you, you should use Statistics::Basic, it may be easier to use and give quite good results, too…
Here you can find the PERL script used to benchmark these modules.