🤖 AI Summary
Critical species distribution data—particularly occurrence records—are severely incomplete, with vast amounts of information embedded in unstructured scientific literature and grey literature, rendering it inaccessible to machine processing and labor-intensive to extract manually. To address this, we present the first deep integration of large language models (LLMs) into the R environment, enabling an end-to-end automated extraction pipeline that combines OCR, natural language processing, anomaly detection, and structured output generation, rigorously validated via human annotation. Empirical evaluation across 100 spider species demonstrates that newly extracted records expand known geographic ranges by three orders of magnitude on average and substantially improve distribution map accuracy. This approach overcomes longstanding bottlenecks in text-mining–driven biodiversity data reconstruction, delivering a scalable, reproducible, AI-augmented data infrastructure for endangered species assessment and conservation planning.
📝 Abstract
1. A hard stop for the implementation of rigorous conservation initiatives is our lack of key species data, especially occurrence data. Furthermore, researchers have to contend with an accelerated speed at which new information must be collected and processed due to anthropogenic activity. Publications ranging from scientific papers to gray literature contain this crucial information but their data are often not machine-readable, requiring extensive human work to be retrieved. 2. We present the ARETE R package, an open-source software aiming to automate data extraction of species occurrences powered by large language models, namely using the chatGPT Application Programming Interface. This R package integrates all steps of the data extraction and validation process, from Optical Character Recognition to detection of outliers and output in tabular format. Furthermore, we validate ARETE through systematic comparison between what is modelled and the work of human annotators. 3. We demonstrate the usefulness of the approach by comparing range maps produced using GBIF data and with those automatically extracted for 100 species of spiders. Newly extracted data allowed to expand the known Extent of Occurrence by a mean three orders of magnitude, revealing new areas where the species were found in the past, which mayhave important implications for spatial conservation planning and extinction risk assessments. 4. ARETE allows faster access to hitherto untapped occurrence data, a potential game changer in projects requiring such data. Researchers will be able to better prioritize resources, manually verifying selected species while maintaining automated extraction for the majority. This workflow also allows predicting available bibliographic data during project planning.