MarkerSet

MarkerSet v1.0 (2008)
Demeure O. & Lecerf F., 2008 – BMC Research Notes

1. DESCRIPTION

MarkerSet is selecting marker sets based on markers positions on the genome and their informativity in users experimental designs. MarkerSet needs a fonctionnal PERL environnement to work.

2. INPUT FILE

The data input file contains the markers information. This includes the marker name, the chromosome carrying the marker, the chromosomal location (expressed in bases) and the number of heterozygous animals for each marker.

The structure of the file is:

makers  chromosome  position  Exp.Design1   Exp.Design2 ..
ABC     1           23400     4             2
QSD     1           25809     1             4
...     ...         ...       ...           ...

The header file is required to describe the different experimental crosses.

3. SYNTAX

MarkerSet.pl -i [file] -a [XX,XX] -d [XX] -m [XX] -c [XX] -s -r -v

software options:

-i  (REQUIRED) data input file: tabulated data text file.
Caution: header is REQUIRED for MarkerSet.

-a  (REQUIRED) number of animals tested in each design, separated by a comma. This number gives the theoretical maximum informativity values.

-d  [nb_required_marker] (REQUIRED) number of experimental design(s) tested.

-m  (REQUIRED) number of markers to select for each experimental design.

-c  [nb_required_marker] multidesign mode. Confronted to several design, users may want to compare what is the best option: selecting markers perfectly fitted for each experimental design or trying to select a larger set of markers common to all design (i.e. n markers specific for 4 different designs and 4n markers in multidesign mode). the multidesign option pools the informativity of each design for each marker.

-s  simulation mode. This option will test different marker selection window sizes to find out which is best fitted to the data set (number of marker, informative values, marker density, etc.). The window size interval is defined by a percentage interval (from 15 to 40% by default) in config.pm file (see section 4). WARNING: time consuming.

-r  resampling mode. In case of no informative marker (informativity equal to a threshold set in config.pm, see section 4) or absence of markers in a marker selection window, this option permits to search for others markers by enlarging selection window iteratively (1 cM by default in config.pm).
Caution! in this case, the $Dmax is disabled (see section 4) and distances between markers could be highly increased.

-v  verbose mode, option for debugging: for every iteration, every position on the genome, the different selected markers will be printed (CAUTION: this mode is VERY verbose).
To record the verbose mode infromation, use the following comand:
-v > filename

MarkeSet command line examples (1000 required marker for each designs, 2 designs with 6 individuals):

1. basic analysis:

MarkerSet.pl -i filename -a 6 -d 1000

2. analysis with multidesign enabled:

MarkerSet.pl -i filename -a 6 -d 1000 -c 2000

3. basic analysis with resampling option enabled, for low density data files:

MarkerSet.pl -i filename -a 6 -d 1000 -r

4. analysis with multidesign, resampling and simulation enabled to evaluate the best parameters:

MarkerSet.pl -i filename -a 6 -d 1000 -r -c 2000 -s

Note: by invoquing the commande ./MarkerSet.pl, the different options are displayed.

4. CONFIG FILE

Core program parameters are defined in the config.pm file, which should be placed in the same directory of the program.

Users may want to customize MarkerSet. This is possible through config.pm which contains most core program variables. Be advised thatwrong variable definitions could altered the program functionning.

This config file has been written as a PERL module, so some parts are absolutely necessary for program operation. Unless you are familiar with PERL, it is highly recommended to edit only the variables declaration, computing and simulation parameters section.

Variables declaration:
  • correspondence between bases and cM
    our $cM_Mb=300000;

    This value give the correspondance between bases and centimorgan (expected bases for 1 cM).

  • maximum distance between two windows
    our $Dmax=20;

    Gives the maximum distance between two markers selection windows. Note that due to spanning over window starting point, maximum distance could slighty be higher. With resampling option, this limit is disabled

  • Window size variables. There are two ways to set the marker selection window :
    with an absolute value in bases:

    our $span=150000;

    by default, this variable is set to undef which means that MarkerSet will compute window size as a ratio of AMI (Average Marker Interval)
    with a ratio value:

    our $span_default=0.XX

    by default, $span_default is set to 0.20 (20% of AMI)

To avoid erratic operation, minimum and maximum windows sizes are computed by MarkerSet. In case of window size lower or higher than min or max limit, the
program is aborted. Users may want to modify theses values (for testing purpose):

  • min limit (half-size set by default to genome size / available markers number): you can set the min value in bases.
    Be advised that the whished number of marker to select could be wrong as the marker selection window size will be lower than the average genome marker coverage!
  • max limit (half-size set by default to 1 cM in bases): you can change this value but distances between markers will fluctuate with a max limit too high.
Computing parameters:
  • threshold factor and increase window size for low informativity cases
    our $threshold_factor=2;
    our $space_percent=0.50;

    Basic mode: set the threshold ($threshold_factor) under which MarkerSet will
    try to find better informative markers (i.e., for 10 tested animals and a
    $threshold_factor=2, MarkerSet will search for better informative markers if
    actual selected one has an informativity strictly lower than 5) and how much
    the marker selection window size will be enlarge on each side
    ($space_percent=0.50, default 50% of the window size). Warning : as in
    multidesign the number of animals can vary between the designs, the
    threshold_factor is NOT the minimal accepted informativity, but a factor of
    the maximal informativity available.

  • threshold and increase window size for resampling
    our $space_resample=1*$cM_Mb;
    our $resample_threshold=0;

    resampling option only: define the enlargement of the marker selection window
    (by default 1 cM in bases, $space_resample=1*$cM_Mb) and the informativity
    threshold under which the resampling mode is activated
    ($resample_threshold=0). Be advised that setting a resample threshold too
    high will generated erratic distances between markers (as the Dmax limit is
    disabled in this mode)

  • min and max limits of AMI’s percentage for simulation
    our $sim1=15;
    our $sim2=40;

    simulation options: range of AMI percentage devoted to marker selection
    window.

5. OUTPUT FILES

The output is a log file (raw text file) with four groups of information. The first one is a description of the analysis (command line, used files, requested options), experimental design(s) (number of animals tested for each design), genome under study (chromosome number and total physical size). The second one gives the informativity weight costs for each experimental design and class of informativity.
The third one describes the number of markers in each class of informativity available for each experimental design. Finally, the fourth part lists the created markers panels, with a decreasing informativity score classification.

In addition, each panel (differing by the first window starting point or by the windows sizes if the simulation mode is activated) has an associated file (tabulated text file). This file contains the informativity score, the number of selected markers compared to the original number of markers for each informativity class and finally the list of the markers giving information on the corresponding chromosome, the chromosomal location, the informativity and the interval distance between the original and the next marker (the markers being sorted by their genome locations).