SGN Assembly Process Version 2
ESTs are preclustered using a custom developed tool to
coarsely identify strong sequence overlaps. (Why precluster?) This
produces a set of pairwise scores to be used in transitive closure
clustering, implemented as a graph algorithm using
depth-first search.
In graph theoretic terms, the sequences are considered
nodes of a graph. Undirected edges between nodes indicate
a detected overlap between the sequences represented by
the nodes. Edges may be weighted, indicating the strength
of the overlap. The connected components of the graph are
discovered by depth first search, yielding a depth first
"forest" of sequence clusters.
Articulation points in the graph are discovered by
analyzing the "tree edge" and "back edge" classification
of edges from depth first search. Nodes identified as
articulation points are potentially chimeric sequences
and their overlaps are analyzed further for adjacent but
distinct homology regions. Sequences with adjacent but
distinct homology regions are considered likely to be
chimeric and are discarded. Since the sequence is an
articulation point, this will break the cluster into two
separate clusters, as expected.
The resulting clusters are supplied as input, with
base calling quality scores, to the CAP3 assembly
program. We have used the following parameters (for
Lycopersicon combined build):
| CAP3 option |
default value |
value used |
description |
| -e |
30 |
5000 |
"extra" number of observed differences |
| -s |
900 |
401 |
minimum similarity score for an overlap |
| -p |
75 |
90 |
percent identity required for overlap |
| -d |
200 |
10000 |
maximum allowed sum of quality scores of
mismatched bases in overlaps |
| -b |
20 |
60 |
quality score threshold for scoring a base
mismatch |
Please see the documentation for CAP3 for further
information on other parameters (which are left to
default values) and complete descriptions of the
above.
The point here is to restrict or eliminate the effect
of the "-e, -s, -d, and -b" options, leaving "-p" in the
driver's seat. This makes the decisions to assemble or
not assemble easily interpretable. The other parameters
are attempts to introduce more sensitive discriminations
than just percent identity of a detected overlap.
However, our experience has shown the effects of these
parameters (at default or similar settings) yield
arbitrary assemblies that dominate over the most
intuitive measure, the percent identity in an overlap.
Preliminary experiments indicate that "-p" is the most
useful option for controlling CAP3's behavior, but its
effects are only noticeable when the other overlap
assessment features (options) are effectively
disabled.
|