Preclustering
Preclustering is a technique used to partition the
input data into groups small enough for the assembly
program to process. Even on powerful computers, an
assembler such as CAP3 or PHRAP can not effectively run
with more than 20,000 input sequences. Either the memory
requirements are too large or the runtime is unacceptably
large.
By preclustering, we reduce the input size into
disjoint groups of sequences which are not at all similar
to any of the sequences in other groups. This limits the
work on the assembler by excluding sequences which are
obviously not transcripts from the same gene. Thus, the
assembly program is used to decide (and assembly into
contigs) the number of unique transcripts in a "cluster"
of similar ESTs, preclustering is used to partition the
input set into disjoint clusters of similar sequences
which are small enough to allow the assembler to run
efficiently.
Transitive Closure Clustering
There are many general methods of clustering data. For
purposes of partitioning data into disjoint sets for
unigene assembly, we use a simple method which we call
"transitive closure clustering." The same methodology has
been described elsewhere as "single-linkage
clustering."
Pairwise scores are found for all pairs of sequences.
If the score for a pair of sequences is higher than some
given threshold, the pair is considered linked. If A is
linked to B, and B is linked to C, then A, B, and C are
clustered together, even if A is not considered linked
with C. Hence, the linkage relationship is transitive,
and a cluster is found by finding the transitive closure
of the linkage relationship.
In context of unigene assembly, this effectively
yields disjoint clusters of sequences for which no
sequence in a given cluster has a detectable coarse
overlap with any sequence in any other cluster. Thus,
there is no possibility for contig assembly of two
sequences which are in different clusters, so the
exclusion does not in theory alter the outcome of the
assembly step. Since the preclustering pairwise
comparisons are much more efficient coarse approximations
than the assembler's full alignments, the overall runtime
and resource consumption of the unigene build becomes
manageable.
|