With NGS data, we are experiencing to handle very (very very) large datasets. Whatever the coding language used (PERL, python, C…), the open/read part of a program is (always) very slow. Thus, the debug process could really be a pain due to the time to read the file. Using system unix tools is one of the approach to cut /merge / debug / analyze the content of the files…
One column contains data with this format XXXX_YYYY and you want to have XXXX and YYYY and two different columns? Easy, use (again) the tr command:
tr '_' '\t' < file > new_file
Faster than a PERL program!
What is my header content?
Somebody has sent you a large file with many (many) columns and you want to know what is the header content (to extract the goog columns with the cut command, for instance)? Whatever the separator used (tab, comma, etc), reading and counting columns on screen is a pain… First, extract the first line with:
head -1 file > header_content
Then, transform your column separator in a carriage-return character (here for a TSV file):
tr '\t' '\n' < header_content > list_header
Finally, to display the header, type:
cat -n list_header
(the -n option will display the line number, easier for the cut command)
How many [XXXX] in my file?
Let’s end with a short introduction about the grep command (I will prepare a more consequent post only for grep next time). The ‘grep’ command is powerful, no doubt about that, but not so easy the first time! Basically, the grep syntax is:
grep [options] my_pattern file
One usefull option is the -w parameter (means ‘word’). Thus, if you only want to ‘grep’ a specific pattern (i.e. chr and not chromosome), you should use this parameter.