Simple Search Tools Can Save Time

Published in Probe Volume 4(3-4): August 1994-January 1995


Genome Informatics Group
Information Systems Division
USDA, ARS, National Agricultural Library
Beltsville, MD 20705

The Genome Informatics Group at the National Agricultural Library now offers eight plant databases, encompassing Arabidopsis, rice, maize, the triticeae, the solanaceae, Chlamydomonas, soybeans, and forest trees. While integrating these into one unified environment remains a formidable technical problem, two very simple tools--WAIS1 and agrep2--make it possible to search one or more databases with a single query. Both are available via the ACEDB World Wide Web (WWW) interface at http://probe.nalusda.gov:8300. To use them, your web viewer will need to handle html forms correctly (the current release of NCSA Mosaic does so for all platforms).

WAIS and agrep queries are particularly useful if your search can be expressed as one or a few words or if you want to examine every record (or "object") in a database (the information is in there somewhere, but where?). Even if you are interested in only one database, the searches are convenient enough for routine use. In addition, agrep supports "fuzzy" searches, described in more detail below.

Searching with WAIS

WAIS is an indexing and searching system that is familiar to many Gopher users. It can be applied to WWW-based information as well to allow rapid, simple queries. We have used WAIS to index nearly every word in the databases available to us (some words are suppressed because they occur too frequently to be useful). A WAIS search is initiated by opening the WAIS form (the WWW page listing all the databases contains a link to it). Use the check boxes to select the databases you wish to search. By default the search is limited to the plant databases. If your query is successful, you will obtain a list of objects that you can click on in the usual way to obtain additional information. If you have searched more than one database at the same time, the list may contain objects from different databases.

WAIS syntax is easy to master. By default WAIS looks for complete, exact (but case-insensitive) matches. For example, if you search for photo, only objects that contain the word photo will be returned. An object containing the word photograph will not be detected. Partial word matching is provided by adding a wildcard, in this case an asterisk (*), to the end of a word (e.g., photo*). Thus, if your first search returns nothing, try the same word with an asterisk at the end. You can only use the asterisk to extend a word. For example, *hoto and ph*to will not match photo.

If you enter a series of words (word1 word2 word3...), the response will be a list of objects that contain matches to at least one of the query words. In this case WAIS is treating the list as if the Boolean operator OR were present as a search modifier (word1 or word2 or word3). A search can also use the operators AND and NOT. Literal phase matching is provided when the phase is surrounded by either single or double quotes. For example, the query "light harvesting" will only return objects that exactly contain that phrase.

Searching with agrep

"Fuzzy" or approximate searches with agrep make it possible to identify objects which do not exactly match your query. This is particularly useful if you are uncertain how to spell an item exactly (was it adh-1 or adh1?). The form for the agrep query is almost identical to that for WAIS, allowing one or more databases to be selected for screening. The response to the query will be a list of objects from one or more databases, depending on which sources you selected.

Type your search pattern into the text box and click the search button to begin the search. By default the search assumes case doesn't matter and that the pattern is surrounded by whitespace. If you are searching more than one database, be patient. Searching every one of the databases (well over 200 MB) might take more than a minute. You can use wildcards, but note that agrep uses # instead of * for this purpose (the asterisk has another meaning to agrep). The Boolean operator AND is represented by a semicolon. Thus Jones;Smith will identify any objects containing BOTH of these patterns. OR is represented by a comma. Jones,Smith will identify objects containing one or both patterns.

An agrep search is fuzzy if you allow mismatching. Use the radio buttons on the agrep form to set the number of mismatches. For example, massechusets matches massachusetts with two errors (one substitution and one insertion). If you want certain parts of the pattern to match exactly, put that part inside angle brackets (<>). For example, < mathemat > ics matches mathematical with one error (replacing the last s with an a), but mathe does not match mathematical no matter how many errors or substitutions are allowed.

WAIS and agrep limitations

Although WAIS and agrep both support multi-database queries, it is important to understand that the way data is organized varies considerably across databases. For example, one database may store information about people and organizations in a category called "Person" while another may use "Colleague." Within the categories themselves there are often differences that can have a significant impact on the results that a query returns. Thus queries should not be expected to yield uniform results from one database to the next.

Even though WAIS and agrep queries can include wildcards (or, in the case of agrep, mismatches), neither tool offers a full complement of pattern matching options. However, a more flexible "regular expression" search tool may become available soon.

Finally, many queries are difficult or impossible to express as WAIS or agrep expressions. In these cases another query method may be more appropriate. "Query by example" and the "query builder" are both tools that are part of the ACEDB software. Each database listed on the WWW interface page has a link to these tools, which operate within the context of one database at a time.

Conclusions

NAL's genome databases can be queried using a spectrum of tools which range in complexity from WAIS and agrep through "query by example," the "query builder," and the ACEDB query language itself. While WAIS and agrep are simple tools, they offer the most efficient methods for screening entire databases for words or phrases.

Queries can be the Achilles' heel of any data retrieval system, and the perfect query interface has yet to be demonstrated for any software. It is therefore important to continue development in this area, especially to support users who are interested in more complex cross-species comparisons.

1. WAIS is copyright (c) MCNC, Clearinghouse for Networked Information Discovery and Retrieval, 1993.

2. Wu and Manber, "Fast Test Searching With Errors," Technical report #91-11, Department of Computer Science, University of Arizona, June 1991.