Published in Probe Volume 4(3-4): August 1994-January 1995
Genome Informatics Group
Information Systems Division
USDA, ARS, National Agricultural Library
Beltsville, MD 20705
The Genome Informatics Group at the National Agricultural Library now offers eight plant databases, encompassing Arabidopsis, rice, maize, the triticeae, the solanaceae, Chlamydomonas, soybeans, and forest trees. While integrating these into one unified environment remains a formidable technical problem, two very simple tools--WAIS1 and agrep2--make it possible to search one or more databases with a single query. Both are available via the ACEDB World Wide Web (WWW) interface at http://probe.nalusda.gov:8300. To use them, your web viewer will need to handle html forms correctly (the current release of NCSA Mosaic does so for all platforms).
WAIS and agrep queries are particularly useful if your search can be expressed as one or a few words or if you want to examine every record (or "object") in a database (the information is in there somewhere, but where?). Even if you are interested in only one database, the searches are convenient enough for routine use. In addition, agrep supports "fuzzy" searches, described in more detail below.
Searching with WAIS
WAIS is an indexing and searching system that is familiar to many Gopher users. It can be applied to WWW-based information as well to allow rapid, simple queries. We have used WAIS to index nearly every word in the databases available to us (some words are suppressed because they occur too frequently to be useful). A WAIS search is initiated by opening the WAIS form (the WWW page listing all the databases contains a link to it). Use the check boxes to select the databases you wish to search. By default the search is limited to the plant databases. If your query is successful, you will obtain a list of objects that you can click on in the usual way to obtain additional information. If you have searched more than one database at the same time, the list may contain objects from different databases.
WAIS syntax is easy to master. By default WAIS looks for complete, exact (but case-insensitive) matches. For example, if you search for photo, only objects that contain the word photo will be returned. An object containing the word photograph will not be detected. Partial word matching is provided by adding a wildcard, in this case an asterisk (*), to the end of a word (e.g., photo*). Thus, if your first search returns nothing, try the same word with an asterisk at the end. You can only use the asterisk to extend a word. For example, *hoto and ph*to will not match photo.
If you enter a series of words (word1 word2 word3...), the response will be a list of objects that contain matches to at least one of the query words. In this case WAIS is treating the list as if the Boolean operator OR were present as a search modifier (word1 or word2 or word3). A search can also use the operators AND and NOT. Literal phase matching is provided when the phase is surrounded by either single or double quotes. For example, the query "light harvesting" will only return objects that exactly contain that phrase.
Searching with agrep
"Fuzzy" or approximate searches with agrep make it possible to identify objects which do not exactly match your query. This is particularly useful if you are uncertain how to spell an item exactly (was it adh-1 or adh1?). The form for the agrep query is almost identical to that for WAIS, allowing one or more databases to be selected for screening. The response to the query will be a list of objects from one or more databases, depending on which sources you selected.
Type your search pattern into the text box and click the search button to begin the search. By default the search assumes case doesn't matter and that the pattern is surrounded by whitespace. If you are searching more than one database, be patient. Searching every one of the databases (well over 200 MB) might take more than a minute. You can use wildcards, but note that agrep uses # instead of * for this purpose (the asterisk has another meaning to agrep). The Boolean operator AND is represented by a semicolon. Thus Jones;Smith will identify any objects containing BOTH of these patterns. OR is represented by a comma. Jones,Smith will identify objects containing one or both patterns.
An agrep search is fuzzy if you allow mismatching. Use the radio
buttons on
the agrep form to set the number of mismatches. For example,
massechusets
matches massachusetts with two errors (one substitution and one
insertion).
If you want certain parts of the pattern to match exactly, put
that part
inside angle brackets (<>). For example, < mathemat > ics matches
mathematical
with one error (replacing the last s with an a), but
mathe
WAIS and agrep limitations
Although WAIS and agrep both support multi-database queries, it
is important
to understand that the way data is organized varies considerably
across
databases. For example, one database may store information about
people and
organizations in a category called "Person" while another may use
"Colleague."
Within the categories themselves there are often differences that
can have a
significant impact on the results that a query returns. Thus
queries should
not be expected to yield uniform results from one database to the
next.
Even though WAIS and agrep queries can include wildcards (or, in
the case of
agrep, mismatches), neither tool offers a full complement of
pattern matching
options. However, a more flexible "regular expression" search
tool may become
available soon.
Finally, many queries are difficult or impossible to express as
WAIS or agrep
expressions. In these cases another query method may be more
appropriate.
"Query by example" and the "query builder" are both tools that
are part of the
ACEDB software. Each database listed on the WWW interface page
has a link to
these tools, which operate within the context of one database at
a time.
Conclusions
NAL's genome databases can be queried using a spectrum of tools
which range in
complexity from WAIS and agrep through "query by example," the
"query
builder," and the ACEDB query language itself. While WAIS and
agrep are simple
tools, they offer the most efficient methods for screening entire
databases
for words or phrases.
Queries can be the Achilles' heel of any data retrieval system,
and the
perfect query interface has yet to be demonstrated for any
software. It is
therefore important to continue development in this area,
especially to
support users who are interested in more complex cross-species
comparisons.
1. WAIS is copyright (c) MCNC, Clearinghouse for Networked
Information
Discovery and Retrieval, 1993.
2. Wu and Manber, "Fast Test Searching With Errors," Technical
report #91-11,
Department of Computer Science, University of Arizona, June 1991.