🤖 AI Summary
Marine lead (Pb) data remain largely embedded in unstructured academic literature, creating data silos that hinder large-scale synthesis; manual extraction is labor-intensive and non-scalable, while general-purpose large language models (LLMs) often produce errors due to insufficient domain knowledge. To address this, this study proposes an expert-guided LLM adaptation framework that collaboratively constructs a domain-specific knowledge tree with marine scientists, decomposing complex extraction tasks into verifiable steps and integrating multi-level validation mechanisms. Without requiring model fine-tuning, the approach achieves high-fidelity information extraction, yielding 3,751 new Pb records from over 230,000 publications—the largest marine Pb database to date—significantly enhancing coverage in under-sampled regions such as the East China Sea and the Southern Ocean, with an accuracy of 92%. An interactive visualization platform accompanies the dataset for community access.
📝 Abstract
Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.