Report of the E. coli Model Organism Informatics Group

Request for NIH support of E. coli as a Model Organism

I. Background

  1. Challenges related to involvement of a broad community
  2. Current E. coli databases and other MODs

II. Proposed New Initiative

Appendices

  1. Comparison of SGD and current E. coli databases
  2. Currently available databases
  3. Process of soliciting interest in this proposal: querying and creating the E. coli community
  4. Members of the E. coli Informatics Workshop

Background: This White Paper was prepared by a subcommittee of a group that met under NIGMS auspices in March 2003, chaired by Susan Gottesman of NCI. The group included informatics experts and experimentalists. The NIGMS Advisory Council, at its September 2004 meeting, considered the recommendations of this Paper and approved an initiative to provide support for the development, continuous upgrading, curation, and maintenance of an integrated data resource for Escherichia coli K-12.

Request for NIH support of E. coli as a Model Organism

E. coli is the organism most studied, the source for much of our information on molecular biology, metabolic pathways, regulation, and biochemistry, and continues to be a source for new insights into how cells work. Because of its central role in both our historical and ongoing understanding of basic biological processes, there is a pressing need to bring together the large amount of historical and rapidly accumulating genome-wide current data and to make this information usable by the larger biomedical community.

Despite its obvious importance, E. coli is currently not even designated by NIH as a "Model Organism." It is essential that E. coli be so designated, and that resources be allocated to support the development, continuous upgrading, curation and maintenance of a Model Organism Database (MOD) as an integrated data resource for E. coli that is at least as functional as those available for other designated Model Organisms. However, we argue that the challenges involved in integrating E. coli data in a usable form also constitute an opportunity to develop data integration paradigms that will have broader applications for other databases. This document outlines the priorities and needs for database integration to address these problems.

This proposal emphasizes the following major points:

Background

Goals - what is needed and what should be required of any federally funded database effort(s).

TOP OF PAGE

  1. Background

    1. Challenges related to involvement of a broad community

      E. coli is the reference organism for all work in bacterial systems and for much work on proteins and metabolic pathways in eukaryotic systems. Integrating all the available information on the biology of any organism is a daunting task. For E. coli, PubMed lists over 65,000 papers with E. coli as a major subject heading1 from the period before the publication of the first complete E. coli genome sequence in 1997. Now that the sequence of two E. coli K-12 isolates and a growing number of pathogenic E. coli strains and related species are available, massive amounts of genome-wide annotations, experimental data, and large-scale comparative analyses to different organisms are rapidly accumulating. All of this information becomes useless if it is not accessible to the community of users, regardless of whether this is due to actual unavailability, to being buried in information overload, or to being available but not usable. Usability involves not just having archives of data, but also having tools to integrate different studies and to integrate modern high-throughput studies with decades of more traditional studies on E. coli and other organisms.

      The specific community of workers who study E. coli and its products directly is large, even if it does not always consider itself a "community" (see Appendix C). A larger set of researchers use the knowledge about E. coli and other model organisms to inform their understanding of other organisms or to study species-independent general properties of living systems. A growing area within the broader E. coli community includes systems biologists who seek to model the integration of fundamental life processes. Thus, the larger E. coli "research community" is immense and basically encompasses the entire biomedical research community.

      What should a web-based E. coli MOD initiative do?

      • Capture the literature and experience for the full community - ranging from new students, investigators, and those looking to extrapolate E. coli information to other systems, to core E. coli research groups, to systems biologists and bioinformaticians seeking to extract insights from this body of information.
      • Provide access to high throughput data and structural data and allow comparisons between results from high throughput data done in different laboratories.
      • Identify anomalies - disconnects that may suggest missing pathways, missing genes (particularly in intergenic regions), missing understanding of the biology.
      • Create the hub of an intellectual community of those using E. coli for different reasons, hastening our ability to synthesize information on this organism.
      • Provide an essential platform for the development of genome databases for other bacteria, both by providing a template for other organisms and by providing a scholarly, curated annotation of genes in this major reference genome for comparative genomics.

      TOP OF PAGE

    2. Current E. coli databases and other MODs

      Thirty-four separate web-based E. coli databases are listed in the "E. coli Database Portal" (http://www.uni-giessen.de/~gx1052/IECA/ieca.html), which has attempted to unify links to web resources for E. coli. Many have been created without dedicated funding or have suffered from lapses in funding at critical points in their development. Many have a relatively narrow focus related to the primary research focus of one lab (proteases, transporters, repeated sequences, etc.). Appendix B summarizes what we could determine about the mission and status of each of the databases in the E. coli portal. Many of the desired functionalities for an E. coli web-based MOD system described above are partially or fully implemented in one or more of the existing databases. However, unification and integration between databases is limited, and some aspects are mostly or entirely missing.

      Several E. coli experimentalists, including members of this panel, have argued that E. coli web resources are missing key components that are available for other MODs, most notably SGD, the MOD for Saccharomyces cerevisiae. Appendix A compares some functions that are available in SGD and different E. coli on-line databases.

      Although some of the existing databases aim to meet some of the goals discussed above, meeting all of these needs is beyond the stated scope of any of their current missions and coverage of key material is variable. While the information and analysis in these databases need to be integrated into any broader effort, the full range of broader needs cannot be met without adding significant new functionalities to what already exists.

      What is needed and currently missing?

      1. An agreed-on nomenclature and annotation of the genome, with all alternative names to allow investigators to easily link studies that refer to the same gene by different names. This process has been begun on an informal basis by a group headed by M. Riley, but will need to be completed and a process for updating incorporated. NCBI specifically requests a committee to serve as the interface between them and the annotation efforts. This needs to be developed for longer-term updates. If an E. coli bioinformatics initiative is funded, it should include this as an essential element.

      2. Literature citations (and links), evidence codes to support annotations, and short text descriptions of known biological functions are clearly desirable but should be considered a continuing task rather than something that needs to be achieved immediately. While some current databases do this, the coverage is far from complete and does not usually provide information on mutant phenotypes that may provide the critical information for deducing new functions. Linkage to other efforts (the E. coli/Salmonella on-line book, from ASM for example) may be a way to achieve some aspects of more complex annotation. Methods of effectively involving the community in continued updates should be explored.

      3. Not clearly delineated in most currently available databases are elements other than genes and operons: cryptic prophages, repeated sequences, small RNAs. This could presumably be added reasonably easily to existing gene and operon-based databases, although definition of some of these elements is not obvious and would need ongoing annotation decisions.

      4. It is currently very difficult to extract from most databases full information on the genomic environment: Start and stop points of transcription, and therefore size and sequence of 5' or 3' UTRs, distance and orientation relative to neighboring genes, other genes in operons. Extracting this information from the literature is currently a challenge.

      5. No current database provides a general repository for high-throughput data. This needs to begin with microarray experiments from different laboratories, hopefully from multiple platforms, published in the literature and the full datasets made available. It should be able to be queried by gene and, hopefully, by other parameters (expression under a given condition, for instance). From a given experiment, it should be possible to find the relevant annotation information about a given gene easily. This is a high priority, if the rapidly accumulating data is to be made full use of. Eventually other sorts of high throughput data should be included in this database.

      6. E. coli (and other microbes) need to be considered in the context of evolution; this helps with identification of function and pathways.
        1. Evidence of the homologs in other organisms; this is different in intent from identification of the broad family, which may have many members in a given organism, but is based on trying to find the closest relatives/same function. Where are they found? What other genes are there?
        2. Genome organization around a given gene in E. coli and other organisms.
        3. A longer term goal should be information on conservation of sequences corresponding to regulatory sites and non-coding regions. A simple link to the DNA (with flanking regions) as well as protein sequences of the homologs would make this much easier.

      7. One-stop shopping. Although many databases cover some aspects of the necessary data, very few scientists are likely to know where to find what. For many scientists, this is one of the major advantages of an integrated MOD for E. coli; investigators can learn to use and query one database rather than a changing multiplicity of them.

      Should the proposed database(s) be specific to E. coli or more general for all bacteria?

      Of the major categories of information described above, some need to be specific to a given organism (E. coli K12 in this case), while others will be better achieved by a broader approach.

      1. Annotations of genes with respect to transcription and translation starts and stops, prophages and other elements will clearly be specific to a given genome. Regulatory circuits and information on regulation may or may not be specific.

      2. Citations and predicted gene function and organization will depend upon both E. coli-specific and general information. A database that identified protein homologs throughout bacterial species and indicated those for which there is experimental information and what sort of information would be useful for all bacterial genes.

      3. High throughput data needs to be compiled in an organism-specific form, at least in the near future, both for useful comparisons between different experiments and for correlation with the genome organization (operons), particularly for genes of currently unknown function.

      4. Evolutionary considerations depend, of course, on the ability to make comparisons with other bacterial genomes.

      Specific examples of what cannot be done with the currently available information (or what is difficult to do).

      In order to investigate the available information for either of the fairly common experiments below, the scientist will have to move between various databases, some of which may use different gene names, different methods for describing a gene, and which may or may not provide the known transcriptional regulatory sites. Finding genome context and relatives in other organisms is even harder.

      1. We run a microarray experiment, comparing expression in our favorite mutant (or overexpression condition) to wild-type. Some genes go up; others go down. The effects may be direct or indirect. To decide what experiments to do next, I need some hints to help me organize a reasonable hypothesis. Investigating these genes through the primary literature, and even through the currently available databases, assuming I know they exist and what is in them, will take a very long time.

        1. What is known about the genes we see changing?
          1. Are they in a known related pathway or part of a known regulon?
          2. If they are known genes, what are the phenotypes of mutations in them, and does that relate at all to what we are studying?
          3. Are some of them in operons, and how good is the data supporting the operon and its regulation? If they are in operons, I can ask if the operon is being regulated as a block or something unexpected is going on in my experiment.

        2. How have they behaved in other array experiments? If one was affected in a given condition, were the others? Does this suggest a new regulon?

      2. We isolate a new mutation or clone in a gene of unknown function that has a phenotype.

        1. What has previously been found about this gene? Hints for function come in many forms.
          1. Phenotypes in previous work? As a mutant or overexpression?
          2. Any interaction data for mutants?
          3. Expression under various conditions in array experiments?
          4. Regulatory sites? Confirmed or predicted?
          5. Related proteins in this and other organisms?

        2. What is the genomic environment of this gene? This information is important to look for types of regulation and regulatory sites.
          1. Other genes in the same operon? If so, what is known about them?
          2. Conserved sequences in leader, promoter, downstream?
          3. Distances to other genes?
          4. Is the genome environment conserved in other bacteria?

        3. What has previously been found for this protein/family of proteins?
          1. Biochemical data
          2. Localization data
          3. Interaction data
          4. Structural data
          5. Predicted related proteins; how many are based on data as opposed to other predictions?


    TOP OF PAGE

  2. Proposed New Initiative

    Organization: In a meeting a year ago and in many emails as well as a conference call, groups currently involved in the major E. coli databases were consulted with the hope of achieving a single consensus for how to best move towards the goals outlined above. No such consensus was reached. Those already heavily invested in the development of separate E. coli databases may well have some of the best ideas about how to proceed, but convincing others with other stakes is difficult. Therefore, a call for proposals that summarizes what is needed rather than exactly how to get there may elicit the most useful responses.

    The FTE estimates below were provided primarily by P. Karp (developer of Ecocyc); comparison to SGD (information provided by M. Cherry) is included where appropriate.

    Four possible models have emerged from the discussions. The first three are focused on E. coli; the fourth assumes a broader bacterial initiative, with some E. coli-specific components. All depend on the most accurate annotation of the genes of E. coli (and other bacterial organisms, to the extent they are integrated in this).

    TOP OF PAGE

    Essential Characteristics of any proposal:

    • Downloadable datasets to allow more computationally-oriented users to manipulate and analyze the data in ways that many users may not find necessary. Thus, while data should be easily accessible on a web site, it should also be downloadable.
    • Programs should be mirrorable - able to be exported easily and maintained in other sites in other countries. Both the data and the tools for analyzing it will be of interest internationally.
    • Whatever is developed should be accessible to those interested in related organisms and should provide a framework for parallel efforts for other organisms.
    • Outreach and Community Interaction Component. Whatever is developed needs to become an integrated part of the research effort, and therefore should include the development of clear documentation and training, as needed, both for using the site and for providing annotations by the research community to the site. To ensure that the database remains oriented towards the user community, an advisory committee of users should be developed. Such an advisory committee could also consider the development of recommendations to journals for encouraging release of high throughput data and possibly annotation information to a site, akin to the deposit of structures and sequences now in place.

    Model A: Minimal change from current efforts:

    Develop a committee of E. coli users and annotators who would interact with organizers of current databases to help with annotation by finding appropriate experts, would agree on new versions of genome to pass on to NCBI, and would develop future requests for funded bioinformatics requirements. This assumes continued support of the major current databases, and their cooperation in this process. If one of these were to fail to get continued funding, this might leave a major hole in what is available. If such a committee were to be developed independently of the advisory committees for the current databases, some support (1-2 FTEs) would be needed for coordination and oversight. In addition, the cost of periodic meetings of the committee should be included ($100,000/meeting, initial meeting and then possibly once per year).

    Hopefully associated with this would be a request for a new effort for collection of and community access to high throughput data, as well as funds to link this effort to current major databases. Estimated cost is 4-6 FTE/year (programmers, curators, system administrator) to develop this, link it to the databases, and enter datasets. This level of funding would support a large variety of high -throughput data. A much smaller effort (2 FTE?) might be sufficient to establish a site and begin the business of collecting array data. Note that SGD has only a limited number of array data experiments available on their site.

    Advantages: Takes the most advantage of currently supported databases; probably the least expensive option.

    Disadvantages: Does not provide any central integration, and does not provide resources to improve coordination. Ties the community to what grew up from individual interests, rather than the best overall approach. Does not provide any guarantee of continuity for individual pieces of the process. Most importantly, does not provide a useful template for other organisms.

    Model B: Integrated Database Warehouse:

    Integrating data from "Knowledge base" with basic annotation information, and "experimental base", the depository for high throughput data, and possibly a strain database as well. This proposal is basically that from P. Karp. In his proposal, this requires 3-5 FTEs/year to provide a central database management system and to provide the integration of all available relevant large databases in addition to the specific E. coli ones that already exist. In addition, the Knowledge base, which would provide curated information on genes, proteins, regulators, and metabolic pathways, would require 6-8 FTE/year. Presumably EcoCyc and Regulon DB, as well as EcoGene, currently do much of this; to what degree this requires new funding is unclear.

    Theoretically the Integrator site could be the same as either the Knowledge database or the high throughput database or could include everything, incorporating rather than integrating the current databases. It could also be a central hub that would subcontract to specific databases as needed for support of interoperability and some operating expenses.

    This is probably most akin to SGD. Currently SGD has 8 FTE curators, 2 FTE programmers, 1 system administrator, and .6FTE Database person, plus 3.5 additional FTEs associated with the Gene Ontology project.

    Model C: Federated Integration through a common website:

    This is a variation in which there is an "integrator" that stands outside all current major databases, but who is specifically asked to integrate data from the current major E. coli databases. I believe this is what K. Rudd has been suggesting. Funding would presumably be similar to that proposed for the warehouse, with additional funds to the database components to help them integrate. A successful proposal would have to include the agreement of the appropriate component databases.

    Model D: Integrated General Bacterial Database Warehouse:

    This model was not discussed as broadly, and therefore estimates of resources are not available. The central database in this case would focus on bacteria in general rather than specific genomes, but would by necessity have at its core the experimental data from the E. coli system. Associated with such a general database would be the development of a template for accumulating and querying high throughput data from reviewed papers; development of organism-specific databases could then be fairly simply added to this.

TOP OF PAGE

Appendices:

  1. Comparison of SGD and current E. coli databases
  2. Currently available databases
  3. Process of soliciting interest in this proposal: querying and creating the E. coli community
  4. Members of the E. coli Informatics Workshop

Appendix A: Comparison of SGD and current E. coli databases

Feature SGD implementation E. coli equivalent(s) and comments
Gene/Protein-level information
Names and Identifiers Includes a widely accepted gene name registry for prepublication registration of new names and policies for naming and name conflict resolution. Locus search will find by alternate names. CGSC and EcoGene were responsible for the most recent (1998) genetic maps. Online registration not available and would be useful for renaming of y genes. Many of the existing databases will find by alternate names.
Basic gene product annotation from GO function Mixed usages in different databases.
DNA and protein sequence retrieval yes EcoCyc, Colibri, others
GO Annotations yes EcoCyc uses its own searchable ontology
ASAP uses GO and multifun
CyberCell CCDB colicards (locus pages) list GO, Riley, and Blattner annotations
Mutant Phenotypes Extensive with references Essential/nonessential in Cybercell CCDB
Homologs SGD has precalculated:
-psi-blast results and taxonomy distributions,
-model organism (not including E. coli) BLASTp hits.
-CLUSTALW within fungi
Cybercell CCDB gives model organism orthologs.
ClustalW alignments available via COGs links in several databases.
Protein Info (physical properties, transcript info) N-term and C-term peptides
Calculated MW, pI, aa content, codon adaptation index
Cybercell CCDB has most of these.
EcoCyc coverage is not uniform (manually curated)
PDB Homologs (protein structure info) table with links to alignment Graphical view of alignments in Doodle
Motifs yes, from various databases. Graphical display along gene EcoCyc - manual curation as links
Cybercell CCDB - lists Prosite motifs
Doodle - graphic display of interpro motifs and predicted coiled-coils.
Genome-wide Expression (and other large-scale analyses) Expression connection allows three kinds of analysis from a variety of published high-throughput datasets. EcoCyc provides visualization tools, but does not archive high-throughput data
ASAP includes UW microarrays
Cybercell has 2-D gels
KEGG has Mori (2001) datasets
OU Microarray core posts datasets
Analysis across datasets seems to be unavailable.
Interactions yes Both physical and genetic EcoCyc, Cybercell CCDB.
Localization yes EcoCyc, others
Community Annotation yes ASAP
Literature Guide Curated papers associated with genes are associated with category keywords and list other genes mentioned EcoCyc - varies. Some lists with links. Others more extensively commented.
EcoGene - lists with links
Genome level info
Genome map browser yes EcoCyc, coliBase, Cybercell CCDB, Colibri, Doodle (gbrowse), others
Genome wide motif query yes only via keywords?
Operons not applicable RegulonDB/EcoCyc
Metabolic and signaling pathways, large complexes
Finding components via GO EcoCyc
Pathway display broken link to BioCyc EcoCyc and KEGG
Enzymatic properties ? Cybercell CCDB lists Km and kcat
Analysis tools
BLAST/FASTA both BLAST at EcoCyc, Colibri, others?
Restriction maps yes Doodle, using gbrowse, others?
Primer design User enters a gene name and parameters. Site finds primers Not available?
Community resources
Expertise database yes Not available?
Meeting announcements yes Not available?

TOP OF PAGE

Appendix B: currently available databases

Of 34 sites listed in the E. coli Portal, most are specific to K-12, 4 to pathogenic strains of E. coli, and the rest to multiple organisms. Text describing these sites was generally taken from descriptions provided for the E. coli portal, with minor editing.

Escherichia coli K-12 Specific databases:

US databases:

US

EcoCyc:
(links to multiple organisms). EcoCyc is a model-organism database for E. coli. The EcoCyc team performs literature-based curation of the following information: the E. coli genome, metabolic pathways, transporters, and genetic regulatory network. Update frequency: once every 3 months; EcoCyc is funded from the NIH National Center for Research Resources.

EcoCyc has incorporated aspects of the following databases:

RegulonDB
RegulonDB is a database on regulation of transcription initiation (promoters, sites, etc) as well as operon organization and transcriptional regulators. It contains information gathered continually from the literature and computational predictions that encompass the complete genome. Update frequency: continuously. Based in Mexico: CIFN/UNAM, Av. Universidad s/n, Col. Chamilpa, Cuernavaca, Morelos. 62210, Mexico.

GenProtEC
GenProtEC offers information on E. coli K-12 gene products, gathered from both earlier and current literature, from analysis of sequence similarities, from analysis of biochemical and structural protein families, and from identification of gene fusions of independent functions. In addition, members of paralogous families are identified. New types of data are being incorporated. Update frequency: monthly.

TransportDB
The distribution of known and putative polytopic cytoplasmic membrane transport proteins was determined bioinformatically for all organisms for which completely sequenced genomes were available. Transport systems for each organism were classified according to: 1) putative membrane topology, 2) protein family, 3) bioenergetics, and 4) substrate specificities. The overall transport capabilities of each organism were thereby estimated. The number of transporters identified in each organism varied dramatically, but was approximately proportional to genome size. Complete lists of the transporters from each organism are provided from the pull down menus on this page. (UCSD)

E. coli Genome project: U. Wisconsin; sequencing of K12 and others, annotations, The initial genome project was funded by the NIH Human Genome Project, precursor to the National Human Genome Research Institute (NHGRI). Our current bacterial pathogens studies are funded by the National Institute of Allergy and Infectious Diseases (NIAID). Our functional genomics studies are funded by the National Institute of General Medical Sciences (NIGMS).

ASAP: ASAP (a systematic annotation package for community analysis of genomes) is a relational database and web interface developed to store, update and distribute genome sequence data and gene expression data collected by or in collaboration with researchers at the University of Wisconsin - Madison;
Updates: continuous.

CGSC: E. coli Genetic Stock Center
The CGSC Database of E. coli genetic information includes genotypes and reference information for the strains in the E. coli Genetic Stock Center (CGSC) collection, gene names, properties, and linkage map, gene product information, and information on specific mutations. The public version of the database includes this information and is accessible in two forms: 1) The CGSC DB_WebServer provides a fill-in-the-blank form that results in direct querying of the database. 2) A direct login to our Sybase APT forms frontend provides somewhat more powerful, but less convenient, query capabilities. The CGSC Collection consists primarily of genetic derivatives of E. coli K-12, the non-pathogenic laboratory strain used in genetic and molecular studies, and includes combinations of 2-29 mutations from among 3500 mutations at more than 1000 different loci. These strains are provided to the community by request, through contact information given on the web site. Update frequency: unknown

E. coli Transcription factor binding sites: This site presents transcription factor binding site predictions in the E. coli genome made by cross-species comparison (i.e. phylogenetic footprinting) using a Gibbs sampling algorithm for motif finding. Predictions were made upstream of 2086 E. coli genes; that is, all genes for which: 1) there was at least 50 bp upstream intergenic sequence, and 2) a probable ortholog was identified among the species used for comparison. The gene names and annotations used are those from the E. coli genome GenBank entry (U00096). Where available, correspondence of our predictions with experimentally verified binding sites or known repeats is presented. Update frequency: unknown Wadsworth labs, Albany, NY.

OU MCF: The OU (University of Oklahoma) Microarray Core Facility website http://www.ou.edu/microarray offers a comprehensive database of E. coli gene expression data complete with detailed experimental details and protocols, as well as tools for semi-automated data analysis in spreadsheets. The E. coli experiment sets include standard and diauxic growth curves, oxidative stress adaptation and recovery, response to altered acetyl phosphate pools, and analysis of the GadXW acid tolerance regulon. Update frequency: daily

The first goal of the EcoReg Consortium is to compile quantitative and genetic data into the database, and to provide a query interface. A fairly simple example query is "Find all data relating to the role of Lrp in pathogenesis". Update frequency: unknown

US: Pathogenesis:

STEC: The STEC Center based at The National Food Safety and Toxicology Center at Michigan State University is designed to facilitate research on the Shiga-toxin producing Escherichia coli by providing a standard reference collection of well-characterized strains and a central on-line accessible database. The STEC Center was established to: 1) Act as a repository for deposition of STEC from new outbreaks and environments as they are identified. 2) Establish and distribute sets of STEC reference strains for use by investigators. 3) Conduct rapid characterization of STEC based on genetic markers of clonal identity and virulence genes; sequencing of flagellin and toxin genes will be performed in order to subtype strains. 4) Develop and maintain an on-line database that integrates new data on the STEC strain set that would be available for data input by collaborating researchers. Update frequency: unknown (MSU)

ECOR / DEC: A major research effort in the Microbial Evolution Laboratory (Michigan State University) is the study of the evolution of pathogenic forms of E. coli associated with intestinal and extra-intestinal infections. Through the analysis of molecular polymorphisms, we are testing evolutionary hypotheses about the major genetic events leading to the origin of new pathogens, such as E. coli O157:H7, and emerging infections diseases. Update frequency: unknown

US: NIH

NCBI: Escherichia coli K12, complete genome: NCBI Genomes web site provides computational tools for better visualization of the genome sequence and the properties of predicted proteins. Feature tables list the DNA coordinates of the structural RNA- and protein-coding genes, their names, and the direction of transcription and allow retrieval of the corresponding DNA fragments. In addition, the Protein feature table lists UWisconsin b-numbers (where available), predicted protein functions and the corresponding protein family entries in the COG database. TaxMap shows the distribution of the genes with closest homologs among Bacteria, Archaea, or Eukaryotes and allows easy visualization of regions, most likely derived by interkingdom gene transfer. TaxPlot allows a three-way comparison of E. coli genes with genes from any other sequenced genomes, which is useful for comparing relative proximity of E. coli to other bacterial genomes and for identifying genes that have particularly close homologs in genomes of other species. This approach often allows identification of recently acquired foreign genes that can be involved in pathogenicity. Update frequency: daily

NCBI COG: The NCBI COG page shows protein families (COGs) that are encoded (or not) in the genomes of E. coli strains K12 and O157. Update frequency: ~annually

International

European databases:

E. coli Index:
Echobase:
Echobase aims to provide functional predictions for uncharacterised E. coli genes (y genes) from integrating information from a range of post-genomic experimental and bioinformatic data. Update frequency: unknown York, UK.

coliBASE is a database for comparative genomics of E. coli, Shigella, and Salmonella. Unlike other online E. coli resources, coliBASE attempts to represent the full diversity of E. coli and related organisms, with a particular focus on pathogenicity. The database includes information from all the available E. coli, Shigella and Salmonella genomes (finished and unfinished), and provides novel visualization and analysis tools, including a viewer that allows rapid comparisons between homologous regions from different strains to identify insertions, deletions, etc.: Update frequency: Sequence data is regularly updated (University of Birmingham, UK).

ECDC: E. coli database collection. Germany; last update Feb., 2002. Updates irregular. ECDC is searchable in different ways: by gene name, sequence map position (traditional min map), by scrolling different tables and by a provisional keyword search. Several tools help to edit any DNA sequence. Links to other databases, like Swissprot or Brookhaven PDB databank. The Escherichia coli database collection is maintained at the Justus-Liebig-University at Giessen (Germany). ECDC is supported by the German Research Council (DFG).

Colibri: (last update: May 4, 1999) Maintained at the Institut Pasteur) Colibri is a database dedicated to the analysis of the genome of Escherichia coli. Its purpose is to collate and integrate various aspects of the genomic information from E. coli. Colibri provides a complete dataset of DNA and protein sequences derived from the paradigm strain E. coli K-12, linked to the relevant annotations and functional assignments. It allows one to easily browse through these data and retrieve information, using various criteria (gene names, location, keywords, etc.) and sequence analysis tools (pattern search, BLAST, etc.). The data contained in Colibri originates from two major sources of information: 1) The reference genomic DNA sequence from the E. coli Genome Project. 2) The feature annotations from the EcoGene data collection provided by Kenn Rudd.

Phydbac: Phylogenomic display of bacterial genomes. Second release: (01/02/2004)
Created by PhD student in Cherbourg, France.
Phydbac provides the precomputed phylogenomic (co-evolution) profiles of all Escherichia coli proteins. Described in: Enault F., Suhre K., Abergel C., Poirot O., and Claverie J.-M. (2003) A new approach to the annotation of bacterial genomes using phylogenomic profiles. ISMB 2003 (in press). Update frequency: unknown

MRC: Cambridge, UK: Transcription factors play a central role in an organism by responding to the various stimuli in the external environment and regulating the expression of specific set of genes. In a simple organism like E. coli, we have identified 271 transcription factors and have studied the protein families and the domain architecture of the transcription factors. The supplementary material contains information about the individual transcription factors which have been identified, with searchable information about the domain architecture, protein function, etc.: Update frequency: unknown

EPD: E. coli protease database; The E. coli protease database lists and classifies 72 proteases according to their cellular localization and protease family. There are links to Swiss Prot, InterPro & PDB databases and topological models are included for the cytoplasmic membrane proteins.UK.

The E. coli Cell Envelope Protein Data Collection: http://www.cf.ac.uk/biosi/staff/ehrmann/tools/ecce/ecce.htm attempts to list all cell envelope proteins. Entries are classified according to their cellular localization (cytoplasmic membrane, periplasm and outer membrane) and function. Topological models are presented for the 896 cytoplasmic membrane proteins and there are links to the relevant Swiss Prot. and PDB entries. There are separate lists of 452 signal sequences and of periplasmic and outer membrane proteins sorted according to their sizes. Update frequency: unknown, UK.

PMTG: Unit of Molecular Programming and Genetic Toxicology (UMPGT), Institut Pasteur; short palindromes and repeated sequences.

ABCISSE: Phylogenetic and functional classification of ABC transporter systems found in living organisms. E. coli information may be obtained directly using the list of organisms. In addition, keyword based searches are provided either by gene name, functional category, substrate type, or biological role. Update frequency: variable, roughly every 6 months; Institut Pasteur

BIGS: Bacterial Targets, IGS, CNRS: Information on conserved genes for possible antibacterial targets; Informatiom Genomique et Structurale / Structural and Genomics Informatiom Marseille, FRANCE

Canada

Montreal-Kingston Bacterial Structural Genomics Institute Database:
(restricted login); Canadian National Research Council Support. This database contains structural information about the E. coli genome(s). It shows the progress of the lab's structural proteomics project and tracks the progress of other worl-wide structural projects as well as deposits in PDB. Update frequency: weekly for tracking.

CyberCell: CyberCell Database (CCDB) was designed to coordinate both the back-filling and on-going experimental studies being conducted on E. coli. The CCDB is a comprehensive repository, Web-accessible database bringing together both observed and derived quantitative data covering most aspects of the genomic, proteomic, and metabonomic characteristics of E. coli. The database self-updates and supports an extensive list of querying and search options. This includes a powerful, easy-to-use relational data extraction system. The CCDB is a composite of four browse able databases: a. The main CyberCell database (CCDB - containing gene and protein information; including figures of 2D gels); b. The 3D structure database (CC3D - containing information for structural proteomics); c. The RNA database (CCRD - containing tRNA and rRNA information); d. The metabolite database (CCMD - containing metabolite information). The supporting utilities for CCDB are written in Perl, HTML and JavaScript. They are available as web version or as flat files. Update frequency: monthly (CCDB is an integral part of Project CyberCell*, a large-scale multi-lab project based out of the University of Alberta.)

Japan

GIB: Genome Information Broker. The Center for Information Biology and DNA Data Bank of Japan is a division of the National Institute of Genetics (NIG), Mishima, Japan. CIB consists of four research laboratories, Laboratory for DNA Data Analysis, Laboratory for Gene Function Research, Laboratory for Gene-Product Informatics, and Laboratory for Research and Development of Biological Databases. CIB also operates the DNA Data Bank of Japan (DDBJ), a member of the International Nucleotide Sequence Database, DDBJ/EMBL/GenBank. GIB is a part of the comparative genome project at DDBJ and provides interactive maps for a variety of organisms.

Sakai/0157: The Genome Information Research Center (GIRC) database at Osaka University (Japan) provides a variety of genome links. This page, however, deals exclusively with Escherichia coli O157:H7, a major food-borne infectious pathogen that causes diarrhea, haemorrhagic colitis, and haemolytic uremic syndrome. We have determined the genome sequence of an O157:H7 strain isolated from the Sakai outbreak which occurred in 1996 in Sakai City, Osaka Pref. Japan. All genomic related data together with links to the respective publications is given. Update frequency: irregular (link not functional).

GenoBase: Scope of GenoBase is to understand comprehensively the living-cell system of Escherichia coli K-12 (W3110). GenoBase is the public repository for Sequence Information, Proteome, Transcriptome, Bioinformatics, Knowledge based on literature concerning E. coli, and Resource Information such as Archive clone, disruption mutants and deletion mutants.

KEGG: KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks in biological processes (PATHWAY database), the information about the universe of genes and proteins (GENES/SSDB/KO databases), and the information about the universe of chemical compounds and reactions (COMPOUND/REACTION databases). The KEGG system is an attempt to uncover and utilize cellular functions through reconstruction of protein interaction networks from the genome information, including gene expression profiles both at the mRNA and protein levels and a catalog of protein 3D structure families. Update frequency: daily (Kyoto)

PEC: The Profiling of Escherichia coli chromosome (PEC) database has been constructed to compile any relevant information that could help to characterize the E. coli genome, especially with respect to discovering the function of each gene. The database is intended to provide an interface comprehensible to most experimental researchers. PEC is based on the sequence information of E. coli strain MG1655. Information, such as basic information (gene name, direction, length, location, etc.) about each gene, was retrieved from the other databases and annotated before incorporation into the PEC database. The genome is displayed as a circle or linearly. Structural domains and motifs of a gene product are displayed graphically along with those of other genes having the same domains or motifs. Update frequency: daily

TOP OF PAGE

Appendix C: Process of soliciting interest in this proposal: querying and creating the E. coli community

All of these unique and important aspects of E. coli biology can be best taken advantage of via a sophisticated and complete bioinformatics platform that serves the need of the users. Why hasn't this already been done? Much of the answer lies in how basic E. coli has been. As pointed out by one scientist, the use of E. coli is so widespread and the subject is so mature that the community frequently does not identify with the organism but with the process under study. Thus, the E. coli community is not as easily defined as the community of zebrafish workers, or C. elegans workers, or even S. cerevisiae workers, and the users of the proposed bioinformatics resources will extend well beyond those who work on E. coli per se. No E. coli meeting exists, making it more difficult to draw up a list of those working on this organism - a scientist working on the secretion system of E. coli may attend meetings on protein localization (in prokaryotes and eukaryotes), while a scientist working on the transcription of a given gene in E. coli may attend meetings on transcription mechanisms, in bacteria and other organisms. While there are a number of meetings that emphasize basic processes in bacteria, dominated by studies on E. coli K-12, the bioinformatics resources will provide a particularly important mechanism for creating an intellectual community, in which information from specific fields can be accessed by others interested in understanding how their own work fits with what has gone before. Thus querying the many possible users about what is needed and advertising the availability of what is created will both be important. Over the last year, information about this initiative was distributed via a number of mechanisms and an invitation to express interest and concerns were publicized. A summary of those efforts is given below, but it is clear that scientists in a wide range of specific fields would find a consolidated entree to E. coli information extremely useful.

History of the effort thus far:

  1. Much of the impetus for this document developed as a result of a series of White Papers developed by the E. coli consortium. Text and ideas from the Bioinformatics White Paper have informed and been incorporated, in part, into the current document. The White paper was considered by the NIH Model Genome Group, and, on the basis of their comments, a workshop on E. coli Bioinformatics needs was held in March, 2003. Attendees at that workshop are listed below.

  2. A document describing the outcome of the workshop (abbreviated minutes and summary) was written and commented on by attendees.

  3. Based on the White paper, the workshop and summary, and available resources, a document was developed with input from members who attended the workshop and others.

  4. An E. coli web site for registering interest in this initiative and distributing the document was developed ( www.Ecoli.Princeton.edu). The availability of the web site and a brief description of the issues under consideration were advertised by flyer and oral presentation at ASM Divisional business meetings, the FASEB prokaryotic transcription meeting, the Meeting on Molecular Genetics of Bacteria and Phage. In addition, ASM emailed a notice about the web site and the initiative to relevant sections of their membership who have agreed to receive ASM emails.

  5. Responses and comments received are summarized in Appendix A. There are currently 463 registered users for the E. coli web site, and 180 votes in a poll of interests. Summaries of the voting are also available in Appendix A; fully 97% of those voting expected to use an integrated bioinformatics site.

  6. A conference call and email correspondence on the specific form that an appropriate effort should take was carried on during April, 2004.

  7. Based on comments on the revised document as well as discussions on the need to specifically address points in the NIH document on "Process for Considering Support for Non-Mammalian Models", the current document was developed.

TOP OF PAGE

Appendix D: Attendees at the E. coli Bioinformatics Resources Workshop

Bethesda, Maryland, March 24 -25, 2003

Frederick R. Blattner, Ph.D.
University of Wisconsin-Madison

Michael Cherry, Ph.D.
Stanford University

Jonathan Eisen, Ph.D.
Institute for Genomic Research

Susan Gottesman, Ph.D.
National Institutes of Health

Michael Gribskov, Ph.D.
San Diego Supercomputer Center

Scott Hultgren, Ph.D.
Washington University School of Medicine

Robert Kadner, Ph.D.
University of Virginia

Peter Karp, Ph.D.
SRI International

Margaret Ann Riley, Ph.D.
Yale University

Monica Riley, Ph.D.
The Marine Biological Laboratory

Kenneth Rudd, Ph.D.
University of Miami School of Medicine

Thomas J. Silhavy, Ph.D.
Princeton University

Barry Wanner, Ph.D.
Purdue University

Ryland Young, Ph.D.
Texas A&M University