Shide Dehghani

UC-Berkeley

ExtractFlora: a pipeline for transforming a floristic manual into a database for ecological and evolutionary study

Historical natural history documents are rich sources of data that can be used in modern big data analyses in ecology, evolution, and conservation. We are developing a novel text analysis pipeline that transforms a three-volume floristic manual (1966-1983) into a database that contains species-specific information such as location data. To achieve this goal, we are utilizing tools, methods, and libraries such as PyMuPDF (fitz), RegEx, and centroid-based clustering to group the entries in the index, identify each species entry in the main text, and extract the associated data. To date, we have successfully processed the index for all 3 volumes and have extracted over 60% of the species entries. We are currently exploring other methods such as fuzzy matching and identifying the entries based on other features of the document such as line spacing to increase the number of entries captured successfully. In the future we plan to explore supervised learning models to refine the entries captured. Once this pipeline is fully developed, we hope that it can be used as a starting point for digitizing other historic floristic manuals to allow other biologists access to reservoirs of data that would otherwise be unusable.

Bio: