|The author column was exported into a raw two-column text file (6.8MB 342,991 lines id|author).
We iterated through the file with multiple regular expression passes to strip out as much non-author text as possible also splitting multiple author lines into individual records. The text was examined for common text patterns for removal and to parse the file down to a regular form of ID \t singleName
|Herrich-Schaeffer [1856 (Pls. 1853)]
|Henderson & Bartsch 1920
||(.*) *& *(.*) *\d+
the resultant file (5.9MB, 373,340 lines) was input into a mySQL table and the table at left generated with
select author, count(author) as cnt from authors group by author having count(author) > 100 order by cnt desc.