SGN Gene Family Analysis

Summary 

SGN gene family analysis groups proteins based on their sequence similarity. It incoporates the Arabidopsis proteome and peptides predicted from SGN unigenes (currently from Lycopersicon combined, Solanum tuberosum, Solanum melongena, Capsicum annuum and Petunia hybrida) and coffee unigenes.

Procedure 
1. SGN and coffee unigenes are subjected to ESTScan, an HMM-based program to predict coding regions and the corresponding peptide from EST sequences[1].
2. Predicted SGN and coffee peptides are combined with Arabidopsis predicted proteins. A self blastp is performed in the combined protein data set[2].
3. TRIBE-MCL program is applied to the blastp result for clustering protein sequences into families[3]. This program first translates blastp result into a similarity matrix. Based on the matrix , the program then groups the proteins using Marcov cluster (MCL) algorithm.
Terms 
Data SetA combination of Arabidopsis predicted proteins and predicted peptide from current SGN and coffee unigene builds. If any of the above data set member is updated, for example, a new unigene build of Solanum tuberosum is built, a new data set is then generated and family analysis is performed in the data set.
i ValueClustering of proteins by TRIBE-MCL is carried out by alteration of two operators called expansion and inflation. While inflation groups genes into clusters, expansion dissipates clusters. I value controlls the strigency of inflation. The higher the i value, the more strigent for inflation operator to group genes together.
Family BuildA family build is uniquely defined by the Data Set and strigency (i Value). For each SGN Data Set, we do TRIBE-MCL analysis with 3 i Values: 1.2, 2 and 5 and obtain 3 Family Builds.
Gene Family of a SpeciesA gene family with at least 1 member gene from the species.
Unique Gene Family of a Species A gene family whose member genes are from a species exclusively.
References 
[1] Iseli C. et al (1999), ESTScan: a Program for Detectingm Evaluating and Reconstructing Potential Coding Regions in EST Sequences, American Association of Artificial Intellegence.
[2] Altschul S.F. et al (1997), Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. NUcleic Acids Research 25, 3389-3402.
[3] Enright A.J. et al (2002), An Efficient Algorithm for Large-Scale Detection of Protein Families. NUcleic Acids Research 30, 1575-1584.