SOL bioinformatics meeting
PAG Conference
San Diego, CA

Lukas Mueller (US)
Beth Swakrlkjvdengi (US)
Nicholas Taylor (US)
John Binns (US)
Robert Buels (US)
Eileen Wang (US)
Linhai Zhang (US?)
Doil Choi, (KR)
Cheol-Goo Hur (KR)
David De Koeyer (CA)
Heiko Schoof (DE)
Farid Regad (FR)
Hans de Jong (NL)
Rene Klein-Lankhorst (NL)
Satoshi Tabata (JP)
Roeland van Hamm
Giorgio Valle (IT)
Maria Luisa Chiusano (IT)
Reinhard Simon (Peru)
Arun Sharma (IN)
Daisuke Shibata (JP)
Sean Humphray (UK)
Klaus Meyer (DE)
Stefane Rombauts (Belgium)
Mathilde Causse (FR)
Maria Raffaella Ercolano (IT)
Sue Rhee (US)


Lukas Mueller, presenting US 

Lukas gives summary of meeting contents.

- Summary of Wageningen
- Reports from chromo seq projects

- was agreed at meeting to define data formats and method standards, quality standards
- agreed that all data should be attributed to data generators
- all centers will be involved in the annotation
      - this means that the annotation pipeline will be distributed, which will be discussed a bit later

------------------------- PROGRESS SUMMARIES ------------------------------
Lukas Mueller
Chromosomes 1, 10, 11

- agreed that the actual submission to GenBank should be the unaltered sequence of the BAC clone
   *  comment (Heiko Schoof): GenBank people have tools to allow submitting updates to the GenBank updates.  Care must be taken in keeping track of ownership of each BAC clone record there.  Updates can be submitted to the entry without changing the original GenBank entry.

- countries have priority on publishing about 'their' chromosome
- full genome will be published by the consortium

- SGN is the central data repository, other locations?
  * there is a lot of work involved in this

comment (Heiko Schoof):  Integrating the data together was agreed to be done on a case-by-case basis.
- US has chromosomes 1, 10, and 11
- sheared library will be used as an unbiased data set to analyze the structure

****action item (Rob): make sure blast dataset is on the ftp site and in the blast databases directory

Current Status of Chromosome II Sequencing
Doil Choi, presenting

- overgo, PCR


- 60 markers, combined they hit about 520 BACs

- 2 BACs have been finished sequencing

Christine Nicholson
Sanger Institute, UK

Group met for first time in December
Sanger will be doing the mapping, sequenceing, and finishing for the BACs for chr. 4.

Will start by selecting small number of BACs looking like they overlap.
Will make all data public as soon as the sequence is produced.
Will be finishing all BACs to the Sanger finishing standard, (1 error in 10^7)
Will be using the BAC-end sequences in tile-path selection.

Lukas to Doil: What do you think is the reason you are having trouble get
Doil: Some of the BACs in the centromeric region are cross-hybridized 
Eileen: What does 'no signal' mean
Doil: No signal from the overgo probing.  FISH always gets signals, but sometimes in the heterochromatin, sometimes on other chromosomes.

Hans de Jong: did you apply ---100 DNA to suppress the effect of repetitive sequences?
Doil: No, just labeled the BAC and used for hybridization.

question: Will Sanger be doing own fingerprinting or using the Ariz. fingerprinting?
Christine: Sanger will use the Arizona fingerprinting and merge electronically

Arun Sharma
Chromosome 5

Project not yet started, expecting funding within the month.
3 places in India:
  Delhi U, south campus
  Ntl. Research Center of Plant Bio
  Ntl. Ctr for Plan Genome Research

Has completed a server hardware setup for bioinformatics

Ready for sequencing of BACs, since much of infrastructure is already there
Lukas: sequencing actually happening at the university?
Arun: yes.

Eileen Wang (SGN) speaking on their behalf

Are revising fingerprinting for Chromosome 3
Are going to use the Bac-End information
Have done fish on about >30 BACs
Chose 20 BACs for sequencing
Should have Phase 1 sequencing results for those by the end of this month

Chromosome 6
Rene Klein-Lankhorst

using markers from Keygene AFLP map, currently integrated with f2.2000 from Cornell
using Keygene Contig matching & bac pool fishing strategies, confirm with FISH
first focus on short arm of Chr 6, only 2 bacs where expected. Very important to check all bacs with FISH. 
Of 4 bacs from cornell, only 1 is in the 6S region when confirmed with FISH
Hans de Jong: 80% of short arm is heterochromatin, maybe this is why 6S is problematic?

41 mapped bacs
31 selected for sequencing
14 finished
9 in sequencing pipeline
	19 contigs per 100kb (expect 10 at 8x)
	76kb = assembled contig length


Long arm physical mapping results look much better than the short arm.

Chromosome 7
Farid Regad

waiting for phase 1 sequencing funding (500K Euros), but a private contractor already has sequencing underway
Genopole Toulouse will be doing all of the bioinformatics

Satoshi Tabata
Chromosome 8

Primarily supported by local Chiba government, not the federal government

17Mb of euchromatin
using 97 markers as seed points
~175 clones to sequence

Lukas: suggests using alternative methods to confirm BACs, not FPC
Satoshi: does not seem realistic to do FISH for each clone
Eileen: suggested steps are 
	1.) reconfirm BAC-marker association
	2.) reconfirm physical location on each chromosome
Lukas: we will talk about another method for doing that involving PCR

not present
Chromosome 9
Lukas: spain is currently in review for funding

(Chromosomes 10 and 11 are being done by the US)

Maria Luisa Chiusano
Chromosome 12

11Mb of euchromatin
projected 113 BACs

sequencing has begun, 3 BACs currently in the sequencing pipeline

mapping confirmation for other two seed BACs is in progress

Bioinformatics pipeline is under construction

	       Structural characterization
	       annotation by similarity
	       anotation by prediction
	       conflict resolution and manual curation
	       project-specific analysis
				Comparative genomics and phylogenetics
				Identification and chracterization of gene....


Eileen: Tomato repetitive sequences (tgr1,2,3) - Chinese group is doing FISH for those
	In process of identifying the BAC addresses of each of those, so we'll know which BACs
	are probably in the centromeric region.  

-------------------------- COFFEE BREAK ---------------------------

Eileen Wang
Seed BAC Selection

-anchor the BAC contigs on the highly-saturated genetic map (F2.2000)

2 parts of the overgo project
  Tomato BAC library - 88,000 clones in the library have been fingerprinted
  Tomato Genetic Markers

Verifying BAC <-> Marker Associations

-select 2 clones (when possible) per marker for sequence verification using the following parameters
	1st choice - insert size > 100Kb, only one or two clones for that marker
	2nd choice - Insert size > 60Kb, insert size unkown from plates #1-#260
	3rd choice - insert size unknow, from plates > #260, sinsert size less than 60Kb

NOTE: when communicating regarding clone IDs, please use only the LE_HBa* clone IDs

Criteria for selecting a seed BAC:
1) large insert size (>60Kb if possible, or unknown)
2. BAC-marker association is reconfirmed by sequencing, overgo hybridization or PCR amplification
3. BAC physical locations are tested

------------------  SGN BAC REGISTRY DEMO ----------------------
Lukas Mueller (US)

Lukas:      add 'Comment' field
H. Schoof:  add fields for FISH, location, graph, (other things)

Doil:	    We need to agree on a method and format for transferring BAC assembly data.

Lukas:	    Most programs create a folder containing seq_dir, chromat_dir, etc, let's just make those folders the basic unit of data transfer for BAC assemblies.

G. Valle:   The data should be mirrored at multiple sites.

Lukas:	    We also need a standard data format for annotation information.  I advocate game XML for this.

Rene K-L:   I would like to have a feature in the bac registry where you can submit a bac name and have an alert if another person has the BAC attributed to them.

Farid Regad (FR):  What backup systems are in place with the data on this BAC registry?  What if you make a mistake during entry?

Lukas:	    Make sure the system is keeping track of _who_ changes the status of a BAC.

Schoof:	    Will a batch interface for this be needed?  

--? :	    Start simple, and we can add functionality (eg batch update) as needed.

Lukas:	    Anything else?


G. Valle:   I have made a little calculation, and I am wondering, would it be advantageous to do another 100,000 bac-end sequences?  I calculate that it would be useful to do up to 500,000 bac-end sequences for finding the minimum tiling path.  I think, though, that 200,000 bac-end sequences should be enough.

Lukas:	    That would be great, but we only have 120,000 clones in the HindIII lib, and 50,000 more BACs in the (other one)

Lukas:	    The BAC-end sequencing should be done during the first quarter of 2005.

Doil:	    So you say the BAC-end sequence data should be available by April?

Lukas:	    Yes, that is if the EcoR1 library turns out to be as easy as _____ says it should be.


C. Nicholson: I'd like to bring up access to the SGN repetitive sequence blast dataset.

Lukas:	      That will be made available on the FTP site.  

Cheol-Goo Hur (KR)

Suggests establishment of a Solanaceae Gene Catalog containing fields for
Function catalog
Gene ontology
Protein Family
Comparative genomics
Stress-Related gene?


Lukas:  the model agreed upon before Wageningen was that each country would supply annotation for 'their' BAC, but at Wageningen the agreement that came out was that the annotation should be more distributed.  Everyone wants to have some control over the annotation.  Many centers have a lot of experience in annotation, so the idea was for every bioinformatics center to make available some sort of interface to their annotation pipeline.
The objective in the end is to have uniform, high-quality annotation available in a timely manner.  This suggests that we should go to the experts.  E.g., if somebody is very very good at microarray annotation, they should to all of it.

So the idea was at Wageningen that everyone would make available a frontend for their annotation pipeline that other members of the SOL project would use.  (Put in a sequence and get out an annotation object like game XML or the like).

This sounds simple, but it's more complicated in practice.  A lot of calculation is involved in generating an annotation object.  For example, MOBY services don't yet fully support very long generation times for returned objects.
H. Schoof: That feature is coming.

--?: Is there an agreed-upon standard for what constitutes a high-quality annotation?

L: We have a standards document that spells out recommendations for this, but they are not the final version yet.  We have a whole list of nomenclature in that document, e.g. if it has a high homology it is a 'known gene'....these terms are listed.  Right now, they live in that document and nobody seems to be aware of it.  

--?: Is there an agreement on how often a BAC is going to be annotated?
L: We sort of thought every BAC would be annotated as it comes back from the seq center, then again by the annotation pipeline, then again at the genomic level by a whole-genome annotation browser.

1.) everybody does their own annotation
2.) annotation through web services
3.) annotation on a whole-genome basis

H. Schoof:  We need at least some centralization to allow looking up annotation in a definitive and centralized way.  The paradigm used by medicago (...???)

G. Valle:  You were talking about using XML for annotion.  Is there already an XML standard that we can use or should we form a committee to draft one?  

Lukas:	   There are a lot of them out there, and we shouldn't reinvent one.  My favorite is game XML.  I would recommend something like that be 

L: what is medicago using?
Schoof:  both game XML and TIGR XML.

Lukas: TIGR XML provides tiling info that game XML doesn't.  Also more expressive for how annotations are defined, etc.  but in Game XML, you can provide the actual alignments that you use to derive the annotation.  In the near term at least, I think game XML is the best.

Schoof:  Isn't that already in the whitepaper?
Lukas: yes.

L: everyone should read it and give me feedback.  

schoff: the important thing about the whitepaper is that it need not be something static, it is more useful to shout out that we should put in something different than to just be silent and do it differently.

L: definitely.  it should be a community-driven thing, and we'll throw out what we're not using.


--?:  I think guidelines are great, but we need to have a very short (!) very minimal set of standards that we can all agree upon.  E.g. how to call, what's unknown, what's similar to X.  We need a real agreement and not loose guidelines.

Schoof:  A lot of that is also vital as documentation as to how the published work was done.

G. Valle: What about gene ontology?

L: This is now going into function annotation.  We will have automatically-mapped gene ontology annotation.(???)

--?:  Doing it on the functional ontology is the most you could do computationally.

Valle:  They are all linked.  In the gene ontology you have to declare the evidence.

--?: In addition to target P, what do you think about doing high-matching sequences, just blastp against arabidopsis and using the high-matching sequences

L: Problem is, that gives you  a very biased set.

Schoof:  Different centers are going to use different methods to assign GO terms.  Is this stuff ever going to be done centrally?

--?:  So you're suggesting that the individual centers don't do it?

Schoof:  No, they should do it.

--?:  that will just create confusion.

Eileen; we should postpone this until we have more sequencing and expression results.

L: we should vocus on the struct. annotation, the gene locations.  We should have that settled at the Isle meeting.

Schoof: (to hur)  I like this idea of a gene catalog you put up.  There is not that much use in standardizing all that, it gives more diversity of ideas.  I'll collect the GO assignments generated all over the world by many different methods, then the user can comb through those using various criteria

--?: Problem is even if you filter like that, it won't be useful and will create confusion.  There has to be  a way to assign a highly specific type of evidence.  If we go with the distributed uncoordinated method it will cause a lot of confusion.  I agree that it doesn't have to be resolved now

L:  We should definitely discuss it at the Isle meeting.

Eileen:  After this meeting we can circulate the guidelines you have, and we can have a guideline for the structural annotation.

L:  If we really want a distributed system, we need to have some (MOBY) servers up and running that we can actually check out and test.  So an 

ACTION ITEM: before ISle meeting have annotation servers online that people can test out and we can come up with quality assignments for the different pipelines
1) write parsers for your tool to spit out game XML

Schoof:   additionally, you should require that the results be in the public domain, so that they can be used.

4?:  Is the performance of the pipelines an issue?  One might like to recalculate everything (the whole genome), you might have to have some very powerful machinery behind that.

L:   I guess it puts the pressure on the server operators to provide those kinds of machines.

Nick:  Remember that compared to sequencing, computer power is pretty cheap.

L:  People often have trouble scaling their pipeline code, remember.

Schoof:  That will have to be a consideration in our evaluation of the different pipelines.

--?:  During the break there was a discussion on naming conventions for contigs, clones, etc.  

L: That's important, and I think the conventions for the read names were covered in the whitepaper but I removed it because the sequencing centers couldn't do it.  For the clone names, it's relatively easy.  Right now, we have different types, but I'd like to standardize on the Cornell name

Eileen:  No, we should use the AZ name because it contains the library ID.  The cornell name doesn't.  

L: But they are so unwieldy and inconsistent....

Eileen:  No, the AZ name is consistent.

--?: Cornell, just make a decision and make everybody else use it.  For the other entities that will be coming from the seq centers, do we have a desire for standard names?

Eileen: Very useful.

--?: So maybe we can come up with a nomenclature and send it to all of the sequencing centers.  

L:  So chromatograms, and what else....leading zeros can create a lot of problems.

Doil:  Just think about it, decide, and inform all of us.

L:  Okay, we can do that.

L: So if we have this unified pipeline, what will we base it on?  Does anybody have recommendations or comments?

L: I guess it's a question for Heiko.

Schoof: In the medicago project, we just chose whatever tools worked for what steps.  Eg TIGR has a good pipeline for....
How data travels between those steps, is just a matter of efficiency.  For now, FTP seems to be the most common, because it's simple and convenient.  For more dynamic, interactive things, we're using MOBY.  Like doing lookups to another database, or getting annotations for a web display....we're using _____  as well as MOBY.

L: you mentioned that MOBY will be extended?

Schoof: you can run MOBY over FTP, also.  It just depends on whether the task warrants building an automated system for it.  Like in Sanger's case, they don't automate it.  Other times, it might be worth it to automate completely.

L: It seems that the data formats are the most important.

Schoof: The only more important thing is documentation.  Even the most obscure data format can be used if documented.

Schoof: Reminds me of versioning, which means especially in a distributed system versioning is very important to people will all be working with the same version of the data.

G. Valle:  How do you define that?

Schoof: For medicago, we have a standards document for that.  E.g. for every gene call you have to have the genbank accesssion, the tool, the date, etc.

L: So does that solve all the problems?

--------- GENE PREDICTION STUFF -------

Rombauts: The software must be adapted to the solanaceae.  There needs to be a training (test) set available for everyone to develop their pipeline on.  Without this, it's easy to set up a system that will break if you don't have good test data.  We could do it in 2 steps, where you use pepper (or whatever data) at the outset, then specialize more as tomato data comes out.  Those that provide an annotation server must be sure that it is specific to solanaceae in its annotation methods.

Meyer: How big a data set would you need? 

Rombauts: If you want to train it, you have to have enough data to give the software a picture of what a whole gene looks like, at least 1000 genes.  

Rombauts: The sooner the people sequencing BACs share them, the sooner the quality software will be there.


Schoof:  Non-specific gene predictors are going to be used no matter what.  As long as we have transparency in the way they are used, we can compensate at some point.

Rombauts:  Just be careful and take all of your data from these with a grain of salt.

L: It just has to be clear to all concerned (funding agencies, users, etc) that the data is going to be a little inaccurate at the outset. 

Eileen:  Maybe we can set up a timeline for the goodness of the dataset, with certain date-based landmarks --- second-stage annotation, third-stage, etc.  Heiko (Schoof) is right.  The whole community has to be aware of the shortcomings of the gene finding programs.

Farid Regad:  maybe each country can contribute to a fully-annotated BAC or two, as a spot-check on the predicted genes.

Eileen: Good idea.

Valle:  That is such a huge amount of work!

L: I think that should only be done with the high-quality annotation.  You would have a lot of work that would not be confirmed.

Schoof:  Well, while you can use EST assemblies....the experience usually is that if you have coverage on both the 5' and 3' ends you can use it pretty well in a training set.  

Schoof:  So at what stage do we want to put the annotation in big databases like genbank?  While it's still preliminary or not until it's high-quality?

L: I would strongly prefer that we wait, because even with the update tools...

--?:  I think this is a VERY big decision that needs to have discussion among all the PIs.

Schoof:  If we decide to put it early into Genbank, we have to make it clear that it's preliminary.  There are different levels of compromise.

Rombauts:  With the sequences, you can have something like stage1, stage2, stage3, that enumerates how firm the annotation is.

Schoof:  That comes back to how we do medicago, where every gene call has to have an associated category like that.

Meyer:  How do the vertebrate people do it?

all: we should ask them.

--?: Users want every there and have the ability to filter themselves.

Chiusano:  If you do large-scale analysis, you can have trouble if you don't have good predictions.  It is a good idea to annotate this data as to how good it is.

L:  Are there any other points?

Schoof:  Do we have a decision on this annotation submission?

L:  No.  I'm worried that crappy annotations will live forever.  But if we can update them, I guess we should submit them, but there is a danger that they won't get updated.

Rombauts:  Ideal would be that every BAC is submitted under the consortium so every member gets updates, and we should do the phase1,2,3,etc.

--?:  We can just put it in the definition line.

L:  There are going to be some problem BACs that will never be finished and might live forever as low-quality.  But they might cover a lot of area and they still need to be public.  In general terms, we should make available all the sequence that we have.

Schoof:  So we have to write a letter to EMBL and/or Genbank that makes a communal consortium account to submit BACs.

L:  Let's discuss this on sol-steering.

Rombauts:  The phases of annotation must be described.  Not just on a gene-by-gene basis, but on a overall BAC-by-BAC basis, describing the phase of the pipeline that was used for gene prediction on that BAC.

Valle: We have to distinguish between the prediction and the sequencing.

Sharma:  Can there be scoring on the pipelines, so there is a continuum of how good it can be considered to be?

Schoof:  We can't right now.

6?: Maybe we can have two scores, one for the pipeline, one for the quality of the data that was fed to it.

Humphray:  That's similar to what Human is doing.  In Vega, it's all manual, in Ensembl, it's all automatic.  Your confidence is adjusted for the system you're using.

Schoof:  In medicago, we give the identifier of the tool used, and it's up to the medicago tool devels(?) to document what that actually means.  

van Hamm: Seems like most vertebrate genome projects now are whole-genome shotgun, so you naturally have more centralization.

L: So shall we go around for final comments and then adjourn?

C. Nicholson: ...

L: I'll make a mailing list for tomato-cap-something.

L: Other comments?

L: So what are the dinner plans?


L: everybody send their notes to me.  And their presentations.