Assembly Process Validation
In an effort to validate SGN's unigene assembly
process, we have attempted to compare our combined
Lycopersicon build with TIGR's tomato gene
index. These comparisons are based on the latest TIGR
tomato gene index available at the time, published on
June 1, 2002. It is noted here that neither SGN's unigene
nor TIGR's gene index builds are supported by
experimental evidence, and thus both remain
approximations of the true nature of the genomes
represented.
Due to differences in input data, such as EST
sequences not common to both builds, and differences in
chromatogram processing, direct comparison of the two
builds exposes mostly "noisy" differences that lead to
inconclusive results in attempts to characterize or
manually curate the observed differences.
Thus, the data presented below serves to indicate the
observed similarity between builds and demonstrate that
neither build differs significantly from the other
indicating a suspicious assembly process. See this page for a discussion on
the assembly process.
|
SGN Lycopersicon combined build #1 |
TIGR Tomato Gene Index |
| Total # of output sequences |
31278 |
31102 |
| Contigs (TCs) |
16200 |
15211 |
| Singlets |
15078 |
15891 |
| Censored inputs |
14310 |
11054 |
| Exclusive Contigs |
0 |
0 |
| Exclusive Singlets |
2044 |
707 |
Contigs are unigenes or gene index
sequences which are composed of the consensus of an
alignment of two or more EST sequences.
Singlets are sequences which have been
determined not to overlap sufficiently with any other
sequence in the input data set. Censored
inputs are input sequences which are not common
to both sets. Exclusive contigs are
contigs composed entirely of input sequences which are
not common to both builds. Exclusive
singlets are singlets found only in the
indicated build. Since no exclusive contigs were found,
this indicates that every contig in SGN's build, and
every TC in TIGR's tomato gene index is represented by at
least one common input sequence for both
builds.
After normalizing the unigene membership data to
compare solely in terms of input sequences common to both
builds, we find:
|
SGN |
TIGR |
| Total # of output sequences |
29234 |
30395 |
| Contigs (TCs) |
15034 |
14432 |
| Singlets |
14200 |
15963 |
Since the input sequences have been normalized to a
common set at this point, and output sequences which are
resultant of exclusively non-common sequences are removed
from consideration, this data suggests that SGN's
assembly process is slightly more lenient, allowing the
assembly of more sequences in to contigs. We find here
that 74.5% of SGN unigene build is identical to TIGR's
gene index. Most of the remaining differences turn out to
be cases where a contig in SGN is represented in TIGR as
one contig and one or more singlets, or vice versa.
Investigation of these cases is consistent with the claim
above, that SGN's build is biased slightly toward
inclusion of sequences into contigs. Although above it
indicates that 2044 singlets are exclusive to SGN, the
number of singlets has not dropped by 2044 becuase some
contigs have become singlets after censoring non-common
input sequences from consideration. The same is true for
TIGR's build.
Since the Lycopersicon combined build and TIGR's
tomato gene index contain data from 3 different
Lycopersicon species, its useful to look at the number of
unigenes specific to Lycopersicon hirsutum and
Lycopersicon pennellii, which ought to show
substantial allelic variation with the species dominantly
represented in the input data, Lycopersicon
esculentum.
|
SGN |
TIGR |
| hirsutum specific contigs |
94 |
157 |
| pennellii specific contigs |
147 |
113 |
| hirsutum/esculentum mixed contigs |
1908 |
1863 |
| pennellii/esculentum mixed contigs |
6552 |
6624 |
From this data, both TIGR and SGN's assembly processes
are allowing the contig assembly of sequences which
contain small evolutionary divergence as well as
sequencing errors. It is not clear from this data whether
or not orthologs are specifically isolated in the
assembly. Neither assembly process at this time contains
specific steps for isolating orthologs from paralogs in
cross-species assemblies. This question can not be
completely settled in silico.
In conclusion, we find that the insight gained from
comparing TIGR's gene index with SGN's Lycopersicon
combined unigene build indicates that each procedure
confirms the predictions of the other in most cases.
Differences are observed, but most are attributable to
differences in inputs to the processes. The reader is
reminded that the above data attempts to characterize the
differences in outputs of two separate processes, while
not being able to control the
differences in inputs. Thus, the conclusive power of the
analysis is limited.
|