Meeting Report: Tomato Annotation Meeting
Ghent, Belgium October 23-25th, 2006
Yves van der Peer (off and on)
Roeland van Hamm
Maria Luisa Chiusano
Heiko Schoof (day 2)
The purpose of the meeting was to discuss the quality of a previously generated gene finder training data set, discuss the performance of already trained, tomato specific gene finders, define a distributed annotation pipeline for the tomato genome sequences that are currently being generated, and to review the data submission procedures. Representatives from 9 countries involved in tomato sequencing and tomato annotation (through the EU-SOL project) were attending the meeting.
Day 1 – October 23, 2006
First, a representative of every country present gave a brief overview of the sequencing progress.
India discussed how overlapping sequences were generated by sequencing seed BACs as far apart as 6cm. Thus, the seed BACs need to be carefully analyzed for potential overlaps, using the FPC fingerprint data. However, the FPC data is not available for all BACs.
Mark Fiers and Erwin Datema reported on trials with 454 sequences of full BACs. The feasibility of such an approach is presently not clear. They also presented a demonstration of Cyrille2, an interactive annotation pipeline system developed in the Netherlands developed at their site.
Maria Luisa Chuisano gave an overview of the EST alignment work and associated web resources that has been developed in her lab at University of Naples.
Daniel Buchan gave a brief overview of the state of chromosome 4 sequencing. Most of the BACs should be available in early 2007.
Thomas, although himself not directly involved on the sequencing side of things, presented progress on the sequencing of the French project and mentioned a technology called DAC, which allows many unfinished BACs to be finished in parallel. The technology is being actively developed. He also mentioned that enough resources may be available to also sequence the heterochromatic partion of chromosome 7. He then gave a brief overview of Eugene.
Francisco presented an overview of GeneID and a first tomato-specific matrix that was developed.
Next, Remy presented an analysis of the training set that was manually generated by some members of the group. A total of 108 BACs were hand-annotated for complete and clean gene models. However, the resulting dataset was not homogeneous in quality, and some low-quality and/or incomplete gene models were retained. A discussion followed, in which Stéphane Rombauts explained that the poplar annotation project had used a very rigorous automated method to generate a training set, and he suggested that we try the same, letting the automated set supersede the hand-annotated set. A general agreement was reached to try this course, with the automated generation performed by Stéphane, using the same methods from poplar. Stéphane also agreed to perform a trial run of his training set generator during the meeting for evaluation.
The cornerstone of the training set generation method is identifying annotations that are very well-supported by EST alignments, and that match at least 75% of their entire predicted protein to a known protein from Arabidopsis. With the number of sequenced BACs available, the trial run of his training set generator produced only 100 very confident gene models with the required level of EST support and Arabidopsis homology. The general conclusion was that this number was too low, but that the method was quite promising, and that final evaluation of the method should be deferred until more finished sequence is available.
The next discussion focused on the submission process of the BAC sequences and annotations. Currently, all project partners are supposed to submit to Genbank and SGN independently, which can lead to inconsistencies between the repositories when the two submission events are far-separated in time. Daniel and Remy suggested submission to Genbank only, from which SGN could pull in the sequences to feed into the annotation pipelines. However, the problem with this approach is that under it, the actual assembly data would not be carried by SGN. A number of attendees asserted that this assembly data is valuable for final assembly and should continue to be rigorously warehoused. Genbank accepts the full BAC sequence and the chromatograms for the individual sequence reads, but not the actual assembly data. After a long discussion, agreement was reached on the following protocol: First, the finished BAC sequence must be submitted to Genbank, and a Genbank accession obtained. Then, the sequences, including the chromatograms and assembly information, is submitted to SGN, using essentially the same submission format as now, but with an additional file specifying the Genbank accession of the submission. SGN will determine the Genbank accessions of the currently submitted sequences and update them accordingly on the SGN FTP site. In addition, the following tags should be embedded in the comments field of each submission to Genbank: “ITAG” (for International Tomato Annotation Group) and “TOMGEN” (for Tomato Genome Sequencing Project). This will allow to download all BACs that were sequenced (TOMGEN) or annotated (ITAG) by searching Genbank for these keywords. A quick search of Genbank determined that these keywords are not presently in use by any other sequences.
In addition, Mark and Erwin at Wacheningen will set up a central wiki site for use by the annotation project for documenting the stages and interchange formats required by the pipeline. (update: the wiki is up at http://www.ab.wur.nl/TomatoWiki )
A discussion on data formats concluded that for most things, GFF3 should be sufficient. GAME XML is richer, but is not as well-specified, and is the native format of Apollo. Artemis is also a very viable gene editor program, and it is capable of using GFF3 as a native save format. It was agreed that, for the present at least, both GFF3 and GAME XML formats will be used, since fairly well-developed conversion scripts exist at several of the sites involved.
Day 2 – October 24, 2006
The main focus of day 2 was establishing a high-level design of the ITAG annotation pipeline. An important aspect of the pipeline is that it is distributed, with many annotation centers participating in the process, with each site doing what they know to do best. The pipeline is based on BAC sequences, and whole pseudomolecule assemblies will also be run once they are available (in a format to be determined later by the ITAG and TOMGEN groups).
In summary, the complete pipeline is as follows:
BAC sequences are uploaded to Genbank, and a genbank accession is obtained.
The BAC is uploaded to SGN.
SGN runs vector screens and contamination screens (chloroplast, mitochondrial and human sequences), and does other quality control, such as comparison of in vitro (from FPC data) vs in silico restriction fragment sizes. The actual submission to Genbank will also be quality checked, sequences compared and the presence of the keywords (ITAG and TOMGEN) assured.
SGN runs RepeatMasker with tomato-derived and other repeat databases. This comes before the other pipeline steps so that some of them have the option of using the repeat-masked BAC sequence.
TBLASTX versus mimulus and potato sequences
BLASTF (script from WUR) versus protein data sets
arabidopsis, swissprot, solanaceae combined – SGN/Korea
other plants (rice, maize, medicago, poplar) – PSB
mitochondria (when available)
transcript sequence alignments (CAB Napoli)
tomato - 98% identity, 90% coverage
solanaceae – 90% identity, 75% coverage
ab-initio gene finders
genscan - ?
genemark - ?
RFAM – blastn/infernal(?)
All predictions, alignments and BLASTs are downloaded by U. Ghent and fed into Eugene.
proteins from the Eugene predictions are then functionally annotated with
BLASTP vs. Arabidopsis and rice proteins, against SwissProt
Interpro – Imperial
GO – MPIZ?
TargetP, signalP, etc. - SGN
TmHMM – SGN
SGN Genes DB – SGN
SGN produces downloadable files and publishes them on FTP
non-redundant protein sequences
Following the establishment of the pipeline steps, a discussion began on data flow between the stages. Early on, it was agreed that an implementation using a central server as a pipeline coordinator would be simpler and more robust. The bulk of the discussion was devoted to whether this central server would call on each remote pipeline stage to perform the analysis as soon as a sequence available (a “push” model), or whether the central server would make the data available and wait for each analysis to retrieve its input and upload its output (a “pull” model). The “push” model has the advantage of allowing more rigorous flow control, since the central server has more knowledge of the running status of each analysis, but requires more from the remote servers, such as availability for external connections and the capability to run the analyses in a highly automated way. The “pull” model does not require external availability or complete automation from the remote pipeline stages, since they only have to download their input from and upload their output to the central pipeline server. Flow control in the pull model would be by means of pipeline status information made available by the central pipeline server, tracking what analysis results are available, and for each analysis, whether its required inputs are ready for download.
Since the “pull” model places less of a burden on each remote pipeline stage, it was decided that (like the medicago annotation project), the tomato distributed annotation pipeline would be pull-driven. To simplify administration, it was also decided that the pipeline should be run on batches of BACs, rather than individual BACs.
Next, a discussion began on the structure and location of the central annotation result repository. It was decided that SGN would house the central repository, and transfer to and from the repository would be accomplished either with scp or sftp running over an encrypted ssh2 channel. An encrypted transfer scheme was preferred over non-encrypted FTP because it offers more secure and flexible authentication mechanisms, greater assurance of data integrity, and acceptable transfer bandwidth requirements. The repository will be configured such that all ITAG participants have accounts and can upload, download, and if necessary delete files from their assigned parts of the repository.
Next, the discussion turned to file naming conventions. The general conclusion was that BACs in the annotation pipeline should be referenced by their unversioned Genbank accession, which is more unambiguous than their well plate, row, and column designations, since wells can be contaminated with other BAC sequences. The unversioned Genbank accession is used to allow for keeping the locus names more stable when the BAC sequence changes. File names and loci names should also be based on these Genbank accessions. Genbank accession-based naming also has the advantage that the accession tends to be shorter than the clone name. Annotation pipeline gene identifiers should thus start with the Genbank accession, followed by an underline and a numeric index number, unique on that BAC. For alternative splicing, the splice variants are denoted with a parenthesized letter following the numeric index number. This can be followed by a dot and a version number to denote slightly differently annotated versions of the same locus. Version numbers are increased if the underlying BAC sequence changes. For example, for the third locus to be annotated on a fictional BAC AC12310, the second of two alternative transcripts, and the first version, its identifier might be “AC12310_3(b).1”. This scheme is similar to the one used in Medicago annotation.
The numeric index does not specify a position on the BAC, but reflects the order in which the gene models were created. When a new locus is annotated, a new numeric index is chosen for it that is one greater than the previous highest index number. If a gene model is created by merging two older gene models, the two old gene model identifiers are retired from use and a new identifier is generated for the merged gene model. For example, if AC12310_7.1 is merged with AC12310_11.1, the resulting locus might be named AC12310_42.1.
Thus, adjacent gene models on the genome will not necessarily have numerically adjacent identifiers, depending on the order in which loci have been added, removed, merged, and so forth since the initial assignment of locus names.
A predictable file naming scheme is critical for a pull-based pipeline mechanism. The following file naming convention for pipeline result files was formulated and agreed upon:
<versioned acc.>.<analysis>.itag<pipeline ver.>.v<file ver.>.<file type>
For example, “AC12310.1.repeatmasker_TIGRRepbase.itag12.v3.gff” would be the third version of the file containing the results of running the analysis 'repeatmasker_TIGRRepbase' on the BAC sequence AC12310.1, as part of version 12 of the ITAG pipeline.
The analysis tags (e.g. 'repeatmasker_TIGRRepbase' or 'eugene') will be determined and assigned by ITAG in the coming weeks.
The ITAG pipeline version is a particularly important part of the file name. Since many analyses in the pipeline depend on the output of other analyses, any change in the methods used at any step (such as updating reference databases or changing output formats) will usually require re-running of some or all of the analyses in the pipeline to ensure that all analysis results remain directly comparable and consistent with each other. Therefore, it will be essential to make these changes in a controlled and coordinated manner. It was agreed that each static snapshot of the analyses and reference datasets used in the pipeline will be given a pipeline version number, starting from 0 and incrementing by 1 each time any change is made to the pipeline that may affect any analysis's output. Pipeline versions may not be incremented while an analysis batch is in progress. Pipeline version increments must be agreed upon beforehand, and will not be allowed while an annotation batch is in progress. It was also agreed that pipeline version 0 should be a special development version. While the pipeline is at version 0, developers are free to change and/or update their pipeline stages without a pipeline increment. When the pipeline is considered to be working and producing good results, the pipeline version will be incremented to 1 and rigorous pipeline version control will begin.
How often should the pipeline be run? It was felt that running the pipeline on single BACs would be a waste of time and a minimum batch size of 10 should be set. In addition, to avoid putting too much of a computational burden on our sites, we also agreed on an initial maximum batch size of 100 BACs. However, these limits should be revisited once the pipeline is running and its performance characteristics are better established.
Final gene annotations will be published primarily in the form of several fasta-format files containing:
Fasta files will use the following format for the description lines:
><locus name> <functional description> <versioned seq. acc.> <evidence codes> <location on seq> <timestamp>
Locus name: properly formatted locus name as set out above
Functional description: a draft functional description of the locus (obtained from functional analysis stages of the pipeline)
Versioned sequence accession: the versioned Genbank accession of the BAC sequence (e.g. AC12312.1)
Evidence codes: string encoding the evidence supporting this annotated locus, composed of one or more of the following letters:
F - Full length cDNA aligned
E - EST coverage
H - homology to an annotation in another sequenced species
I - ab initio prediction
Location on sequence: 1-based nucleotide coordinate range on the BAC sequence, formatted as <start>-<finish>. e.g. 41223-48128
Therefore, an example of a properly-formatted description line would be:
>AC21353_4(a).2 putative x-ray vision protein AC21353.1 FEHI 12931-18446 2006-10-31/14:36:22
The annotation of pseudo genes will be worked out at a later date.
The format for the pseudomolecules to be used will be worked out at a later date.
Day 3 – October 25, 2006
This was a half-day meeting, and was mostly devoted to clarifications and additions to the decisions made in the preceding two days. Minimum and maximum BAC batch sizes were discussed again briefly, agreeing on an initial minimum and maximum batch size of 10 and 100 BACs respectively.
Additionally, a request by Lincoln Stein for permission to do a genome-wide annotation using the ensemble annotation pipeline was discussed. The decision was made not to grant permission for him to publish an annotation at this time, since his analysis pipeline will not be specifically tailored to tomato, leading to a lower-quality annotation, and it would lead to confusion about which genome annotation is the “official” one.
Also, there was a discussion of the need for a note to be attached to our BAC sequences in Genbank, asking that people defer genome-wide analyses until our official annotation comes out. A consensus was reached that the text of this note should be discussed and agreed upon at the upcoming SOL project meeting in November.
Next, some clarifications to the pipeline versioning scheme were made. The idea of a free-development pipeline version 0 was introduced (already covered above). The mechanics of pipeline synchronization were briefly discussed, with Rob clarifying that SGN intended to provide both a human-readable web page showing pipeline status and a machine-readable pipeline status web service, as described above.
Next came a discussion of arrangements for further tomato annotation meetings. An agreement was reached to hold a tomato annotation meeting at PAG in San Diego in January. Also, an agreement was made to try to have a phone conference of tomato annotators every two weeks. Stéphane introduced the VRVS service (http://www.vrvs.org), a non-commercial internet conferencing service, as a possible mechanism for doing this without the cost of international phone calls.