décembre 2012

Unix  Code more shell scripts (Part. II)

With NGS data, we are experiencing to handle very (very very) large datasets. Whatever the coding language used (PERL, python, C…), the open/read part of a program is (always) very slow. Thus, the debug process could really be a pain due to the time to read the file. Using system unix tools is one of the approach to cut /merge / debug / analyze the content of the files…

Wanna split?

One column contains data with this format XXXX_YYYY and you want to have XXXX and YYYY and two different columns? Easy, use (again) the tr command:

tr '_' '\t' < file > new_file

Faster than a PERL program!

What is my header content?

Somebody has sent you a large file with many (many) columns and you want to know what is the header content (to extract the goog columns with the cut command, for instance)? Whatever the separator used (tab, comma, etc), reading and counting columns on screen is a pain… First, extract the first line with:

head -1 file > header_content

Then, transform your column separator in a carriage-return character (here for a TSV file):

tr '\t' '\n' < header_content > list_header

Finally, to display the header, type:

cat -n list_header

(the -n option will display the line number, easier for the cut command)

How many [XXXX] in my file?

Let’s end with a short introduction about the grep command (I will prepare a more consequent post only for grep next time). The ‘grep’ command is powerful, no doubt about that, but not so easy the first time! Basically, the grep syntax is:

grep [options] my_pattern file

One usefull option is the -w parameter (means ‘word’). Thus, if you only want to ‘grep’ a specific pattern (i.e. chr and not chromosome), you should use this parameter.

software  DGD accepted for publication in PLOS ONE

The paper describing the construction and the content of this database had been accepted for publication and the article is now available on the PLOS ONE website. We improved the first release by adding data from GEO and GO database. These addition permitted to analyse the gene coexpression level and the annotation similarity inside each group of duplicated genes (see below).

 

The database is available here : http://dgd.genouest.org. A contact form is available on the official AnnotQTL website, you can leave a comment or ask for new species.