Published in Probe Volume 6 (Final): July 1996
Douglas W. Bigwood
Manager, Genome Informatics Group
Information Systems Division
USDA, ARS, National Agricultural Library
Beltsville, MD
Larger images can be viewed at http://probe.nal.usda.gov:8000/otherdocs/ww/Vol2/bigwood.html
One of the difficulties in searching any data resource is the lack of a facility to find items that are close to, but not exactly like, the search terms(s) of interest. Wildcarding (for example: the use of asterisks in some search software) aids the user to a certain degree, but often the results returned are not exactly what is desired. A fuzzy searching facility, using a program called agrep, is provided on the Agricultural Genome World Wide Web Server at the National Agricultural Library. A brief description of how to use this facility, along with some hints and examples, follows.
Figure 1. The agrep query form
To connect with the agrep search form, open the following URL in your World Wide Web
browser: http://probe. nal.usda.gov:8300/agrepquery.html
You will be presented with a form similar to that shown in figure 1. The form allows you to select three search parameters: case-sensitivity (default is case-insensitive), whether to search for the pattern as a "word" (as opposed to the pattern occurring as a part of a word; default is to search for pattern as a word), and whether or not to search for the pattern as a superset (for example: "the only" matches "the one and only"; default is do not search for pattern as a superset). You may also select zero, one, or two mismatches. A mismatch may be an insertion (for example: "Lansberg" will match "Landsberg"), a deletion ("adh-1" will match adh1", or a substitution ("Smith" will match "Smyth"). A cautionary note: allowing two mismatches can result in a slew of unexpected results, particularly if one short search term is used. For example, searching for the pattern "adh" with two mismatches will bring back any object containing an a, d, or h.
Figure 2. The agrep query results document
One simple but common use for agrep is to account for differences in the British and American spellings of various words such as center/centre and color/colour. Other common usages include searching for a persons name you do not know how to spell and adjusting to slight differences in gene nomenclature among taxa.
Search terms can be combined to perform boolean searches using a terse notation: "this,that" tanslates to "this or that"; "here,there" translates to "here and there." The wildcard character is "#"; and it can be used anywhere in the search term.
The agrep form also provides a suite of plant and other databases to search. These can be searched together or selected à la carte. Because there is a limit of 1,000 objects, this is a useful way of limiting your search. In addition, because agrep searching involves paging through large text files, you can save time by selecting only the databases of interest.
The results document (see fig. 2) lists objects that match the search terms. At the top of figure 2, the search string is displayed with the databases that were searched. The names of the objects that contain the search string are then listed, grouped by the objects class (for example: locus or paper) and database. Clicking on an objects name will bring up the full data object. See figure 3. to the right.
Comments concerning the agrep search facility, or any other aspect of the Agricultural
Genome World Wide Web server, should
be e-mailed to:
feedback@probe.nalusda.gov.
Reproduced with the full permission of Weeds World: International
ElectronicArabidopsisNewsletter.
http://probe.nal.usda.gov:8000/
otherdocs/ww/Vol2/bigwood.html (17 June
1996)
For more information on this or other search tools, please refer to the article, "Simple Search Tools Can Save Time," Probe, Volume 4, Number 3/4, pages 19-20.