G3C

This document describes the procedure for using G3C package, creating required databases and script order.

For PERL script, several modules are required (go-db-perl in particular). All of this modules are available via CPAN. You should also notice that most of the scripts use a connection to a SQL database: you have to fill the $user (and $password if needed) field(s) in each script.

SUMMARY

  1. Perl modules
  2. Before runnig G3C
    1. G3C SQL BASE structure
    2. Get Group of similar GO terms file
  3. Running G3C and retrieved files structures
  4. Results files analyses
    1. G3C_Analyse
    2. G3C_keyword
    3. G3C_enrich_cluster_QTL.pl
    4. G3C_coregul
    5. G3C_generate_stat and G3C_generate_stat_IC
  5. Annexes : SQL Dump of the G3C database structure

1. Perl modules

Install the following PERL modules, in order to use every G3C scripts:

DBI — 1.609
BioPerl — 1.6.0
go-perl — 0.09
go-db-perl — 0.01
Digest::SHA1 — 2.11
Memoize — 1.01
PDL — 2.4.4
Statistics::Descriptive — 2.6
Statistics::Distributions — 1.02
Statistics::OLS — 0.07
Switch — 2.14
Getopt::Long — 2.38

2. Before runnig G3C

Create and fill G3C SQL database then generate similar GO terms groups (Group_SIM).

2.1 G3C SQL BASE structure

These two tables have to be filled to run ‘G3C.pl’. Create a SQL database calle ‘G3C’ and two tables named ‘goa’ and ‘GoLoc’ with these structures.

‘goa’ table

This table allows to get all GO_ids for each gene product and each species in order to get GO terms for gene_id.

'goa' table structure
+---------------+---------------+-------+---------+
| Field		| Type		| Null	| Default |
+---------------+---------------+-------+---------+
+db	    	+ varchar(40)	+ Yes	+ NULL	  +
+db_id		+ varchar(60)	+ Yes	+ NULL	  +
+db_symbol	+ varchar(60)	+ Yes	+ NULL	  +
+go_id		+ varchar(20)	+ Yes	+ NULL	  +
+go_ref		+ varchar(50)	+ Yes	+ NULL	  +
+evidence	+ varchar(10)	+ Yes	+ NULL	  +
+go_class	+ varchar(3)	+ Yes	+ NULL	  +
+name		+ varchar(200)	+ Yes	+ NULL	  +
+goa_species	+ varchar(60)	+ Yes	+ NULL	  +
+IPI_id		+ varchar(20)	+ Yes	+ NULL	  +
+---------------+---------------+-------+---------+

Download the gene_association.goa_species.gz files foreach analyzed species on the EBI FTP interface (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/).
Then unzip the files in a directory and run the PERL script :

g3c_sql_goa.pl -f file1 -f file2...

‘goa’ table must be filled before filling GoLoc.

‘GoLoc’ table

This table allows to store location informations and go_id for each genes.

'GoLoc' table structure
+---------------+---------------+-------+---------+
| Field		| Type		| Null	| Default |
+---------------+---------------+-------+---------+
+ species	+ varchar(60)	+ Yes	+ NULL	  +
+ go_class	+ varchar(10)	+ Yes	+ NULL	  +
+ go_id		+ varchar(20)	+ Yes	+ NULL	  +
+ symbol	+ varchar(60)	+ Yes	+ NULL	  +
+ IPI_id	+ varchar(20)	+ Yes	+ NULL	  +
+ chr		+ varchar(10)	+ Yes	+ NULL	  +
+ start		+ varchar(20)	+ Yes	+ NULL	  +
+ end		+ varchar(20)	+ Yes	+ NULL	  +
+ strand	+ varchar(10)	+ Yes	+ NULL	  +
+---------------+---------------+-------+---------+

Download ipi.genes.SPECIE.xrefs.gz files foreach specie required on the EBI FTP interface : (ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/) and unzip the files in a directory. Rename the files in order that species name is written as the same in the goa_species field in the goa table.
Then run GoLoc_g3c2sql.pl in this directory with this instruction :

g3c_sql_GoLoc.pl -f file1 -f file2...

In order to use G3C_Analyse and other G3C scripts you should create two more tables called ‘ensembl’ and ‘go’.

‘ensembl’ table

This table contains ENSEMBL informations foreach genes and is required to use G3C_Analyse.pl.

+---------------+---------------+-------+---------+
| Field		| Type		| Null	| Default |
+---------------+---------------+-------+---------+
+ID		+ varchar(40)	+ Yes	+ NULL	  +
+name		+ varchar(50)	+ Yes	+ NULL	  +
+description	+ longtext	+ Yes	+ NULL	  +
+strand		+ varchar(10)	+ Yes	+ NULL	  +
+chr		+ varchar(40)	+ Yes	+ NULL	  +
+start		+ varchar(60)	+ Yes	+ NULL	  +
+end		+ varchar(60)	+ Yes	+ NULL	  +
+---------------+---------------+-------+---------+

Download BIOMART file required (at http://www.biomart.org), with this header structure:

EnsemblGeneID     Description     Chromosome   Start    End   Strand  GeneName.

(fields are separated by tabulations)
Then run g3c_sql_ensembl.pl in the biomart directory with this instructions :

g3c_sql_ensembl.pl -f file

‘go’ table structure

This table contains GO simple informations foreach genes and is required to use G3C_Analyse.pl.

+---------------+---------------+-------+---------+
| Field		| Type		| Null	| Default |
+---------------+---------------+-------+---------+
+goid		+ varchar(40)	+ Yes	+ NULL	  +
+goclass	+ varchar(3)	+ Yes	+ NULL	  +
+description	+ text		+ Yes	+ NULL	  +
+---------------+---------------+-------+---------+

GoLoc table must be filled before install this table.
Run g3c_sql_go.pl.

2.2 Get Group of similar GO terms file

Create a list of unique go terms for each class from goa table (you have to run simGIC for each list of go terms before). Create a directory with the SimGIC results files, named like this :

SSM_GOclass_Cellular_List.txt for Cellular Component class.
SSM_GOclass_Biological_List.txt for Biological Process class.
SSM_GOclass_Molecular_List.txt for Molecular Function class.

Run filter_sim.pl to exit GO terms with a similar value under a threshold, for each file:

filter_sim.pl -s file1 -s file2 -s file3 -t x.xx

where x.xx is the threshold value.

Then create similar GO terms groups (Group_SIM) by running g3c_group_sim.pl, like this :

g3c_group_sim.pl -s file

where file is the group_sim_file_list generated by filter_sim.pl named like this :

FILTER_x.xx_SSM_GOclass_list

You can get informations on similar GO terms groups by running g3c_group_sim_analyse.pl.

3. Running G3C and retrieved files structures

Then you can get co-annotated and collocated clusters of genes by running G3C.pl with this instructions:

G3C.pl -s LIST_GROUPSIM - g LIST_Duplicate [-u / -d]

The LIST_GROUPSIM file should contain the folowing data: GO class name and the GROUPSIM filename for a specific similarity value, as indicated here:

P    GROUPSIM_0.5_SSM.txt
F    GROUPSIM_0.3_SSM.txt
C    GROUPSIM_0.6_SSM.txt

This software will generate two flat files, one with statistics about the results and the results for each species.

STAT file has the following name structure:

STAT_cocannot_coloc_clusters_SIMvalues_x.xx.txt (where SIM values are the group similarity values for each class used during the analysis).

This file contain the following information for each species:

  • how and when you get this results
  • number of annotated genes and the number of associated GO terms for each species, class and chromosome
  • number of 1-n similarities for each class and specie
  • number of coannotated genes for each GROUP_SIM, chromosome and species.

RESULT files have the following name structure: tag_duplicate_specie_filtered_results_threshold_SIMvalues_x.xx.txt (where SIM values are the group similarity values for each class used during the analysis and x.xx ajdusted p-value threshold).

Results files contain the following information with a header line :

  • species: analyzed specie
  • class: GO class
  • chr: chromosome
  • cluster: unique cluster ID (chromosome_symbol+GroupSIM_ID\start\end)
  • ID_gene: unique gene combination cluster
  • accessor GO: accessing GO for this Group SIM
  • list_GO: list of GO terms used to get these genes (* if not used)
  • list_genes: list of co-annotated genes within the cluster
  • IC: Information Content mean of the Group SIM
  • pvalue: raw p-value
  • adj_pvalue: adjusted Benjamini-Hochberg (1995) p-value
  • start: start of the first gene in the cluster (in bp)
  • percent_dup: percentage of duplicated genes inside cluster
  • prop_dup: proportion of duplicated genes inside cluster (nb_dup // nb_total)
  • length: length of the cluster (in bp)
  • lengthKB: length of the cluster (in Kp)
  • N: nb of genes within a chromosome
  • n: nb of selected genes within a cluster with a common GO term
  • k: nb of genes within a cluster
  • m: nb of genes within a chromosome with a common GO term
  • additional information:
  • (solo co-annoted genes): list of co-annotated genes not in any cluster
  • non co-annoted genes within the cluster: list of these genes

4. Results analyses

Several post-analyses scripts have been developped 1) to get annotations information about selected clusters, 2) to generate some statistics on clusters sizes, duplicated percentage distributions or 3) to compute clusters genes coexpression or clusters colocated inside QTL interval region.

4.1 G3C_Analyse

This script retrieve useful informations for cluster and each co-annotated genes within the cluster by asking with the cluster ID.

Run G3C_Analyse.pl, like this :

G3C_Analyse.pl -r Result_file_from_G3C.pl

Then type the clusters IDs you want informations on, one by one. When finish type q to end the script.

This script will display annotation informations on the standard ouput (i.e terminal screen). However, data are also log into several file: ANALYZED_G3C_species (which contain all the annotation information) and GENES_ANALYZED_G3C_species (which the list of gene ID (i.e. Ensembl ID) that have been displayed).

NB: These files will be deleted each time you run the script.

4.2 G3C_keyword

This script will find all clusters containing the different chosen keywords, which are present in at least one of their GO terms description. It retrieves the genomic location of either these clusters or each gene in these clusters. By default, it generates genome-wide results (each genomic location are proportionally recalculated as if they were on the same chromosome). You can also run it to get results for each chromosome. To run G3C_keyword, use the following instruction:

G3C_keyword.pl -r G3C_results_with_group_genes_ID_file -c Chromosome_lenght_file -k keyword1 -k keyword2 -k keyword3...

Chromosome_lenght_file must have two columns separated with tabulation, the first contains the chromosome number or letter and the second contains the chromosome size in bp.
Keyword could be only the beginning of a word, for example ‘phospholip’ will retrieve as well phospholipase as phospholipid… You can use several keywords to fetch all the genes of a biological functions (i.e. apoptosis, death, …): in this case, all the data will be merged in one file.

[options]
-n name: choose the output filename, default concatenate used keywords.

-u: retrieve results for each chromosome separately, name of the output file will end with CHR, default name ends with GW for genome wide. Also add an header with the chromosome size, the density and the number of selected genes on chromosome, and the percentage of selected genes within clusters.

It generates three files:

specie_name_GENES_(CHR or GW)
Contains chromosomal location of genes which have at least one of the chosen keywords in their associated GO term descriptions, within selected clusters (C) or not (S),

specie_name_CLUS_(CHR or GW)
Contains chromosomal location of selected clusters (median)

specie_name_GO_Desc_CLUS_(CHR or GW)
Contains full description of each GO term containing the chosen keyword for each cluster:
chr group_gene_ID chr_size cluster_median go_terms_descriptions

4.3 G3C_enrich_cluster_QTL

This script is designed to perform an enrichment test of the location of genomic cluster within a QTL interval location. This script use only the annotation from the BP class. You must provide 2 files for this script:

  • results file from G3C
  • QTL location from QTLdb with this structure:
    CHR: chr name
    DB: database
    Type: type of location (i.e. QTL)
    Start: location start (in base)
    End: end location (in base)
    Interval: location interval
    Trait name: name of the trait
    QTL_ID: ID from QTLdb
    QTL_type: type of statistical signification
    P-value (=<): p-value of the test
    Trait Ontology: ontology of the trait
    Ontology SLIM: groups of ontologies
    PubMed_ID: pubmed ID of the article
  • the similarity group file used for the previous analysis of G3C (used for the BP class)

run the software using this following instruction:

G3C_enrich_cluster_QTL.pl -g GROUPSIM -q QTL_LOCATION -r RESULT_CLUSTER_G3C
with the following options:
verbose : -v optional, verbose DEBUG mode (STDOUT)
GROUPSIM: -g [file] similarity group file
QTL     : -q [file] QTL location file
result  : -r [file] cluster result file (original)

The results are written on STDOUT. If you don’t whish to perform an enrichment test, you can use the script G3C_coloc_cluster_QTL.pl, which will just indicates the G3C clusters that co-locate within QTL region.

4.4 G3C_coregul

This script is designed to compute a covariance matrix with the expression data of the genes inside a cluster. The script also fetchs data from the ensembl SQL table.

you must provide 3 files:

  • clusters genes list file (with no header) with the following structure:
    Ensembl_ID G3C_cluster_ID
  • expression data with the following structure:
    Ensembl_ID Expression_data (in a separate column)
  • results file from G3C

run the software using this command line:

G3C_coregul.pl -c CLUSTER_GENE - f EXPRESSION_DATA -r RESULT_CLUSTER_G3C
cluster: -c [file] Clusters genes list file (no header)
file   : -f [file] Expression data file
result : -r [file] cluster result file (original)

The results are written on STDOUT

4.5 G3C_generate_stat and G3C_generate_stat_IC

The G3C_generate_stat.pl script will ouput in 2 files the genomic size and the percentage of duplicated genes for each unique cluster (i.e. unique combination of genes) for further statistical analyses. The G3C_generate_stat_IC.pl will output on terminal the mean IC values (and standard deviation) for each class. The command to run these scripts are the following:

G3C_generate_stat.pl -r RESULT_CLUSTER_G3C
G3C_generate_stat_IC.pl -r RESULT_CLUSTER_G3C

5. Annexes : SQL Dump of the G3C database structure

Here you can find a SQL dump of the structure of the G3C database. You can copy-paste these SQL instructions to create your own database.

-- --------------------------------------------------------

--
-- Structure de la table `ensembl`
--

CREATE TABLE IF NOT EXISTS `ensembl` (
`ID` varchar(40) default NULL,
`name` varchar(50) default NULL,
`description` longtext,
`strand` varchar(10) default NULL,
`chr` varchar(40) default NULL,
`start` varchar(60) default NULL,
`end` varchar(60) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Structure de la table `go`
--

CREATE TABLE IF NOT EXISTS `go` (
`goid` varchar(40) default NULL,
`goclass` varchar(3) default NULL,
`description` text
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Structure de la table `goa`
--

CREATE TABLE IF NOT EXISTS `goa` (
`db` varchar(40) default NULL,
`db_id` varchar(60) default NULL,
`db_symbol` varchar(60) default NULL,
`go_id` varchar(20) default NULL,
`go_ref` varchar(50) default NULL,
`evidence` varchar(10) default NULL,
`go_class` varchar(3) default NULL,
`name` varchar(200) default NULL,
`goa_species` varchar(60) default NULL,
`IPI_id` varchar(20) default NULL,
KEY `db_id` (`db_id`,`db_symbol`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COMMENT='Import des fichiers GOA EBI cow, chicken, human, rat & mouse';

-- --------------------------------------------------------

--
-- Structure de la table `GoLoc`
--

CREATE TABLE IF NOT EXISTS `GoLoc` (
`species` varchar(60) default NULL,
`go_class` varchar(10) default NULL,
`go_id` varchar(20) default NULL,
`symbol` varchar(60) default NULL,
`IPI_id` varchar(20) default NULL,
`chr` varchar(10) default NULL,
`start` varchar(20) default NULL,
`end` varchar(20) default NULL,
`strand` varchar(10) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;