written by: Patrick Leary
Last modified: April 21, 2009, 7:01 pm
This application parses freetext and identifies scientific name and author combinations. It can parse text files, HTML or XML encoded files, PDF and Word documents. The work of Drew Koning, developer of the TaxonGrab application developed at the American Museum of Natural History under an NSF-funded Taxonomic Literature Project was the inspiration for this algorithm although it does not draw upon any of the source code or lexicons from this excellent resource. FindIT extends the functionality of the basic name recognition process to include the capacity to recognize author citations, taxonomic rank and nomenclatural annotation that may occur within a scientific name string. The SOAP methods incorporate a name parsing algorithm for structured XML output.
This application performs two iterations through an input file. PDF text is extracted using the Glyph and Cog pdftotext shell script. XML and HTML entities are converted to their text representation.
The first iteration attempts to discriminate possible name and author combinations from non name/author text. The second iteration then evaluates the results of the first iteration and provides a confidence score. Iteration 1 utilizes pattern-matching expressions and a lexicon of English words to identify possible scientific names and author information. A non-English capitalized word followed by a non-English lower-case word, for example could indicate a genus and species combination. Approximately 3,000 English words have been flagged as co-occurring within scientific nomenclatural based on cross-referencing with our collection of approximately 4.7 million scientific names [12/1/2005] within our NameBank collection.
The result of the first iteration is an array of text strings that represent potential scientific name and author combinations.
The second iteration parses this array and provides a confidence score for resultant names. The parsing and scoring utilize some of the following elements:
Approximately 330,000 distinct biological genera.
Approximately 250,000 distinct species and infra-species names.
Approximately 27,000 suprageneric names
Known nomenclatural and author lower-case names ("van," "de," "de la," "von", "sp. nov.") that may be misinterpreted as species names.