Find name...    11,106,374 NameBank Records
 uBio Project
List Classifications
Name sources
 Web Services
Using uBio SOAP
 Sample Applications
Nomenclator Zoologicus
Demo Applications
 Editors: Sign in
recognize.php man file Tools - FindIT - Help

written by: Patrick Leary
Filesize: 13576
Last modified: April 21, 2009, 7:01 pm

This application parses freetext and identifies scientific name and author combinations. It can parse text files, HTML or XML encoded files, PDF and Word documents. The work of Drew Koning, developer of the TaxonGrab application developed at the American Museum of Natural History under an NSF-funded Taxonomic Literature Project was the inspiration for this algorithm although it does not draw upon any of the source code or lexicons from this excellent resource. FindIT extends the functionality of the basic name recognition process to include the capacity to recognize author citations, taxonomic rank and nomenclatural annotation that may occur within a scientific name string. The SOAP methods incorporate a name parsing algorithm for structured XML output.

This application performs two iterations through an input file. PDF text is extracted using the Glyph and Cog pdftotext shell script. XML and HTML entities are converted to their text representation.

The first iteration attempts to discriminate possible name and author combinations from non name/author text. The second iteration then evaluates the results of the first iteration and provides a confidence score. Iteration 1 utilizes pattern-matching expressions and a lexicon of English words to identify possible scientific names and author information. A non-English capitalized word followed by a non-English lower-case word, for example could indicate a genus and species combination. Approximately 3,000 English words have been flagged as co-occurring within scientific nomenclatural based on cross-referencing with our collection of approximately 4.7 million scientific names [12/1/2005] within our NameBank collection.

The result of the first iteration is an array of text strings that represent potential scientific name and author combinations.

The second iteration parses this array and provides a confidence score for resultant names. The parsing and scoring utilize some of the following elements:

Genera lexicon Approximately 330,000 distinct biological genera.
Epithet lexicon Approximately 250,000 distinct species and infra-species names.
Supra-generic lexicon Approximately 27,000 suprageneric names
Prefix lexicon Known nomenclatural and author lower-case names ("van," "de," "de la," "von", "sp. nov.") that may be misinterpreted as species names.
Genus and species suffixes An an analysis of genera and species suffix combinations is used to assess confidence in unknown potential genus and species strings.
Ambigous lexicon Strings that are both text words and scientific names. Examples include Attila, Biota, Lagos, Torpedo, Virginia, aborigine, etc.
Non lexicon Words and phrases that made it through the recognition method and are explicitly declared as not scientific names.

Mission statement | Advisory board | People | Contact us
uBio copyright © 2021 by The Marine Biological Laboratory