Unigene Assembly Process Overview
The "unigene problem" consists of two fundamental questions:
The ability to answer either question correctly and consistently enables an algorithm for precise assembly of a unigene build from EST sequences. It is plain to see that if (1) is yes, then answers for (2) are known (errors are where the sequence differs, barring allelic variation). As well, if (2) were determined, then (1) is easily settled by examining an alignment of the sequences for true differences in the overlapping region.
Constructing a unigene build must attempt to solve both questions simultaneously. This is different from genomic DNA assembly, for the following important reasons:
There are no widely used, freely available assemblers for EST data, so we do the next best thing: use a genomic assembler such as phrap (P. Green) or CAP3 (X. Huang ). CAP3 is typically preferred for EST assembly (see  for a discussion), being less aggressive at splitting apart contigs.
In general deciding whether or not to assemble two sequences together is a very easy question as long as the observed differences between the sequences are significant. When the observed differences in two sequences approaches the rate of sequencing error, determining whether or not two different genes are represented by the sequences becomes theoretically impossible without collecting more data. Since error rates in a collection of sequences appear as a distribution, the result is a range of observed differences where actual differences and sequencing errors make assembly decisions arbitrary.
The likely result is the over-representation or under-representation of gene families which contain recently diverged paralogs. Additionally, if the organism sequenced is heterozygous at many loci with significant allelic variation, similar results may occur.
This may be controlled by selection of threshold parameters governing the assembly process, but there is no "one size fits all" threshold that accurately decides all cases. Particular choices of thresholds may either (a) promote false detection of distinct but similar genes (b) promote false detection of alleles (by assembling close paralogs together) or (c) do both (neutral choice of parameters). For SGN's assembly, we have decided to proceed with option (b), to attempt to minimize the number of false isolations of unique transcripts.
Future versions of SGN's unigene build process will include the option for the user to inspect an assembly's multiple sequence alignment (MSA) as well as view the major alternatives incorporated in any given assembly.