Published in Probe Volume 2(1): Spring 1992
Drs. Sam Cartinhour, J. Michael Cherry, S. Hanley,
B. Hauge, and Howard Goodman
Department of Molecular Biology
Massachusetts General Hospital, and
Department of Genetics, Harvard
Medical School, Boston, MA
A specialized genome database on the small plant Arabidopsis thaliana is expected to be available this spring to assist molecular biologists, geneticists, and other researchers. Scientists at the Massachusetts General Hospital (MGH) in Boston developed the database, known as AAtDB (An Arabidopsis thaliana Database).
The database is one objective of the Multinational Coordinated Arabidopsis thaliana Genome Research Project. It was initiated primarily to encourage a coordinated research effort to use Arabidopsis as a model system for studying the biology of flowering plants.
The Arabidopsis genome project has reached the point where the physical map is taking form. Because the genome of Arabidopsis is small, about 1/30th the size of the human genome, the project is advancing rapidly. Thus, the need for a database is acute.
Specialized Database Needed
Why the need for a specialized database? Existing databases are simply not adequate to manage genomic information in a useful way. Molecular biologists and geneticists routinely use computers and databases to organize and interpret complex biological information; however, these databases are typically intended to perform a single set of related tasks and present the results of a query on the computer screen as text.
For example, one database may retrieve and analyze DNA sequences, while another offers bibliographic information, and still another tracks down the availability of mutant strains.
This segregation by data type may seem to be an efficient way to manage information. However, the specific focus of the "divide and conquer" strategy is not without cost--the user who needs to follow an information trail through several databases will find the experience awkward and frustrating.
Data Integration
The broad scope of the genome initiative creates a demand for a new kind of database--one that accommodates many different types of information. The various genome projects' intentions are to correlate genetic information with the underlying physical structure of the chromosome. In effect, this means blending information--such as genetic maps, collections of cloned DNA segments, and DNA/protein sequence databases--that, until recently, was maintained separately.
As thousands of overlapping cloned DNA segments are matched with each other and aligned into large, continuous "physical maps," the need to integrate the data in a practical way becomes necessary for bookkeeping purposes and for the convenience of access. Management of a large-scale combined cloning and DNA sequencing effort requires that it be possible to visualize at a glance the relationships between cloned DNA segments, the genetic map, and the DNA sequence database. The overall status of the project can then be monitored. Ultimately the information recipients--the greater scientific community--need the same tools if they are to use the information effectively.
AAtDB Design
AAtDB is designed along the broadly-inclusive principles described earlier. The database allows a researcher to browse the enormous variety of information associated with the Arabidopsis genome and zero in on specific details. The process is simplified by using of graphics. Genetic maps, physical maps, and the features of DNA sequences are drawn as pictures in the way that molecular biologists are accustomed to seeing them, rather than presented as columns of text and numbers.
The Arabidopsis database goes well beyond the "single purpose" biological databases most scientists are accustomed to using today because of two important features.
First, users obtain information from the database by using a mouse to "click on" objects in windows, in much the same way the Apple Macintosh computer allows the user to retrieve information. Information in AAtDB is always available in windows, whether it is simple text (for example, a bibliographic reference) or a pictorial representation of something more abstract (such as a genetic or physical map). In either case, users find out more information or move to completely new categories of information by "clicking on" hot spots, which can be either words or symbols.
Second, information in the database is linked together by a large number of interconnections. There is no single starting point for asking a question, nor are users required to move through the information in a single direction along a single path. Consequently, specific information in the database can be found in a variety of ways.
AAtDB also features a query language that can be used to directly search for keywords, phrases, or values in specific fields.
Central Feature
The central feature of AAtDB is the integrated presentation of the genetic map, the physical map, and the DNA sequences that have been determined for Arabidopsis. The genetic map consists of over 500 markers on 5 chromosomes; the physical map currently contains nearly 15,000 cloned DNA segments. The DNA sequence collection contains over 300 entries from the GenBank DNA database,including their annotation. For each DNA sequence, the results of "similarity searches" are also available (obtained by comparing all possible amino acid translation products for each DNA against several protein sequence databases).
In addition to the genetic and physical maps, AAtDB contains bibliographic references for journal articles, books, theses, and symposia, which are organized by author, journal, and accession number. Many references have been provided by the National Agricultural Library (NAL) from the AGRICOLA bibliographic database as well as from other sources. Information for strains from the Nottingham Stock Centre Seed Catalog has been provided by the Stock Center at Nottingham, England. Also included are the contents of "The Greenbook" for Arabidopsis by Meyerowitz and Pruitt (with gene, allele, and bibliographic entries cross-referenced to the rest of the database). In addition, the database includes contact information for over 350 researchers and segregation data for many RFLP markers.
Plans are to add more information, including scanned images, to document both the characteristics of mutant strains and the hybridization patterns produced by RFLP probes; an expanded list of keywords; raw data from genetic crosses; information on characteristics of mutant alleles; all seed, stock, and clone information from the new Ohio State University/Michigan State University Arabidopsis Biological Resource Center; and other information pertinent to the Arabidopsis community.
Database Software
The software for the database came from Dr. Richard Durbin (MRC, Cambridge, UK) and Dr. Jean Thierry-Mieg (CNRS-CRBM, Montpellier, France). Last year they released a database to accommodate the rapidly accumulating information generated by the C. elegans genome project. An important feature of the C. elegans database, called ACeDB (A Caenorhabditis elegans Database), is that it is easily adapted to meet the informatic requirements of a wide variety of organisms. This versatility makes it relatively easy to reconfigure ACeDB to manage information for Arabidopsis thaliana.
Obtaining AAtDB
The database currently runs as a stand-alone system on Sun Microsystems work stations as an X-Windows application. Plans are to make available an Apple Macintosh version. Currently, Macintosh and Microsoft Windows users, or individuals with access to X-Windows server software, can use AAtDB via a network connection to a UNIX computer that runs the database software.
AAtDB is a stand-alone distributed, rather than a centralized, database. To obtain AAtDB, users will copy the database software and data from an archive on the Internet worldwide computer network. Once the local copy of AAtDB is installed, users can run the software on their work stations and have access to all the collective Arabidopsis information. Initially, the database will be distributed via Internet anonymous FTP procedure from sites in the United States, including NAL and MGH.
Distribution sites will also be established overseas. Electronic mail and the same FTP mechanism will be used to distribute updates to the software and data. Eventually, a CD-ROM version will be available from NAL.
NAL is providing funds to support the database development for Arabidopsis thaliana and four additional plant species: wheat, pine, soybean, and maize. Eventually the information from the five databases will be fed into a main database at NAL.
Contact
Readers who desire additional information about AAtDB can contact Dr. Sam Cartinhour or Dr. J. Michael Cherry in Professor Howard Goodman's laboratory via Internet computer mail at curator@frodo.mgh.harvard.edu, or via FAX (617) 726-6893.