juin 2010

PERL  Don’t dereference if you don’t need…

Sometimes, the PERL syntax could be… obscure, especially if you’re dealing with Hash of pointers to Lists… Let imagine you’re handling a tree structure of a family, for instance. Then, you will have a Hash where the keys are the parents. In this Hash, you will store a pointer to a List, which store the children. Thus, your data file has this structure:

parent1 child1
parent1 child2
child2 child3
...

Then, while reading the file content, you will use the following code:

my @T = split(/\t/, $line);
my $parent = $T[0];
my $child = $T[1];
push (@{$Family{$parent}}, $child);

Thus, to find all the children of one parent, you can use a recursive function:

sub get_children ( $ $ $ $ ) {
	my ($root_parent, $parent, $_Family, $_List) = @_;
	foreach my $child (@{$$_Family{$parent}}) {
		push (@{$$_List{$root_parent}}, $child);
		&get_children ($root_parent, $child, $_Family, $_List);
	}
}

But, don’t produce this code (even if the syntax could more simple):

sub get_children ( $ $ $ $ ) {
	my ($root_parent, $parent, $_Family, $_List) = @_;
	my %Temp =  %$_Family;
	foreach my $child (@{$Temp{$parent}}) {
		push (@{$$_List{$root_parent}}, $child);
		&get_children ($root_parent, $child, $_Family, $_List);
	}
}

In the last version, the line my %Temp = %$_Family dereference the pointer to another Hash. If you’re dealing with small pedigree, it won’t affect the performance of the script. But, if you’re handling (very) large families, the performance will drop down, since at each recursive step, the whole pedigree Hash is dereference to another Hash (meaning memory transferts)…

Let’s benchmark the two different codes with a large pedigree of 33000 individuals (store in %Family) and re-contruct the list of all the children for a parent with an offspring of 600 individuals (kind of families we have to manage in animal genetics):

Pointer version

Time taken was  0 wallclock secs ( 0.00 usr  0.00 sys +  0.00 cusr  0.00 csys =  0.00 CPU) seconds

Dereference version

Time taken was 27 wallclock secs (26.08 usr  0.09 sys +  0.00 cusr  0.00 csys = 26.17 CPU) seconds


Need more comments?