recognize.php
written by: Patrick Leary
Filesize: 13537
Last modified: May 18, 2007, 11:28 am
This application parses freetext and identifies scientific name and author combinations. It can parse text files, HTML or XML encoded files, PDF and Word documents. The work of Drew Koning, developer of the TaxonGrab application developed at the American Museum of Natural History under an NSF-funded Taxonomic Literature Project was the inspiration for this algorithm although it does not draw upon any of the source code or lexicons from this excellent resource. FindIT extends the functionality of the basic name recognition process to include the capacity to recognize author citations, taxonomic rank and nomenclatural annotation that may occur within a scientific name string. The SOAP methods incorporate a name parsing algorithm for structured XML output.
This application performs two iterations through an input file. PDF text is extracted using the Glyph and Cog pdftotext shell script. XML and HTML entities are converted to their text representation.
The first iteration attempts to discriminate possible name and author combinations from non name/author text. The second iteration then evaluates the results of the first iteration and provides a confidence score. Iteration 1 utilizes pattern-matching expressions and a lexicon of English words to identify possible scientific names and author information. A non-English capitalized word followed by a non-English lower-case word, for example could indicate a genus and species combination. Approximately 3,000 English words have been flagged as co-occurring within scientific nomenclatural based on cross-referencing with our collection of approximately 4.7 million scientific names [12/1/2005] within our NameBank collection.
The result of the first iteration is an array of text strings that represent potential scientific name and author combinations.
The second iteration parses this array and provides a confidence score for resultant names. The parsing and scoring utilize some of the following elements:
Genera lexicon
Approximately 330,000 distinct biological genera.
Epithet lexicon
Approximately 250,000 distinct species and infra-species names.
Supra-generic lexicon
Approximately 27,000 suprageneric names
Prefix lexicon
Known nomenclatural and author lower-case names ("van," "de," "de la," "von", "sp. nov.") that may be misinterpreted as species names.
Strings that are both text words and scientific names. Examples include Attila, Biota, Lagos, Torpedo, Virginia, aborigine, etc.
Non lexicon
Words and phrases that made it through the recognition method and are explicitly declared as not scientific names.
Results are scored based on the presence of known names within the name combination or if an unknown name falls within the probability range of known latin name suffixes.
An Administrative mode allows results to be used to refine the recognition algorithm either through modification of the rule-sets or through the addition of returned strings to one of the lexicons.