ARETE: an R package for Automated REtrieval from TExt with large language models

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Critical species distribution data—particularly occurrence records—are severely incomplete, with vast amounts of information embedded in unstructured scientific literature and grey literature, rendering it inaccessible to machine processing and labor-intensive to extract manually. To address this, we present the first deep integration of large language models (LLMs) into the R environment, enabling an end-to-end automated extraction pipeline that combines OCR, natural language processing, anomaly detection, and structured output generation, rigorously validated via human annotation. Empirical evaluation across 100 spider species demonstrates that newly extracted records expand known geographic ranges by three orders of magnitude on average and substantially improve distribution map accuracy. This approach overcomes longstanding bottlenecks in text-mining–driven biodiversity data reconstruction, delivering a scalable, reproducible, AI-augmented data infrastructure for endangered species assessment and conservation planning.

Technology Category

Application Category

📝 Abstract
1. A hard stop for the implementation of rigorous conservation initiatives is our lack of key species data, especially occurrence data. Furthermore, researchers have to contend with an accelerated speed at which new information must be collected and processed due to anthropogenic activity. Publications ranging from scientific papers to gray literature contain this crucial information but their data are often not machine-readable, requiring extensive human work to be retrieved. 2. We present the ARETE R package, an open-source software aiming to automate data extraction of species occurrences powered by large language models, namely using the chatGPT Application Programming Interface. This R package integrates all steps of the data extraction and validation process, from Optical Character Recognition to detection of outliers and output in tabular format. Furthermore, we validate ARETE through systematic comparison between what is modelled and the work of human annotators. 3. We demonstrate the usefulness of the approach by comparing range maps produced using GBIF data and with those automatically extracted for 100 species of spiders. Newly extracted data allowed to expand the known Extent of Occurrence by a mean three orders of magnitude, revealing new areas where the species were found in the past, which mayhave important implications for spatial conservation planning and extinction risk assessments. 4. ARETE allows faster access to hitherto untapped occurrence data, a potential game changer in projects requiring such data. Researchers will be able to better prioritize resources, manually verifying selected species while maintaining automated extraction for the majority. This workflow also allows predicting available bibliographic data during project planning.
Problem

Research questions and friction points this paper is trying to address.

Lack of machine-readable species occurrence data in scientific publications
Manual data extraction is time-consuming for conservation research
Need to accelerate biodiversity data processing for conservation planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates species data extraction using large language models
Integrates OCR, validation, and outlier detection in R package
Systematically compares model outputs with human annotations
🔎 Similar Papers
No similar papers found.