Unigene Assembly Process Overview

The "unigene problem" consists of two fundamental questions:

  1. Are these two sequences from the same gene/transcript?
  2. Where are the sequencing errors in this sequence?

The ability to answer either question correctly and consistently enables an algorithm for precise assembly of a unigene build from EST sequences. It is plain to see that if (1) is yes, then answers for (2) are known (errors are where the sequence differs, barring allelic variation). As well, if (2) were determined, then (1) is easily settled by examining an alignment of the sequences for true differences in the overlapping region.

Constructing a unigene build must attempt to solve both questions simultaneously. This is different from genomic DNA assembly, for the following important reasons:

  1. EST sequencing methodology does not yield an expectation of stochastic oversampling of each DNA base. In genomic sequencing, with 8X expected coverage for example, answering (2) above becomes easier as there are several observations for each base once proper alignment is determined.

  2. The optimal outcome of assembling a BAC is exactly one contig. The implied answer to question (1) above is then always yes: all subclones belong in the same contig.

There are no widely used, freely available assemblers for EST data, so we do the next best thing: use a genomic assembler such as phrap (P. Green) or CAP3 (X. Huang [1]). CAP3 is typically preferred for EST assembly (see [2] for a discussion), being less aggressive at splitting apart contigs.

In general deciding whether or not to assemble two sequences together is a very easy question as long as the observed differences between the sequences are significant. When the observed differences in two sequences approaches the rate of sequencing error, determining whether or not two different genes are represented by the sequences becomes theoretically impossible without collecting more data. Since error rates in a collection of sequences appear as a distribution, the result is a range of observed differences where actual differences and sequencing errors make assembly decisions arbitrary.

The likely result is the over-representation or under-representation of gene families which contain recently diverged paralogs. Additionally, if the organism sequenced is heterozygous at many loci with significant allelic variation, similar results may occur.

This may be controlled by selection of threshold parameters governing the assembly process, but there is no "one size fits all" threshold that accurately decides all cases. Particular choices of thresholds may either (a) promote false detection of distinct but similar genes (b) promote false detection of alleles (by assembling close paralogs together) or (c) do both (neutral choice of parameters). For SGN's assembly, we have decided to proceed with option (b), to attempt to minimize the number of false isolations of unique transcripts.

Future versions of SGN's unigene build process will include the option for the user to inspect an assembly's multiple sequence alignment (MSA) as well as view the major alternatives incorporated in any given assembly.


  1. Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence Assembly Program. Genome Research, 9: 868-877
  2. Liang Feng, et. al. (2000) An optimized protocol for analysis of EST sequences Nucleic Acids Research 28, 3657-3665