🤖 AI Summary
This study evaluates GPT-4’s capability to extract structured information on-demand from materials science literature, specifically assessing its zero-shot fidelity in reproducing two manually curated materials datasets. Methodologically, we employ an interdisciplinary, domain-expert–driven error attribution framework—integrated with rigorous human annotation—to systematically diagnose model output deviations across dimensions including numerical accuracy, contextual disambiguation, and unit standardization. Results reveal significant fidelity bottlenecks in GPT-4’s scientific information extraction (IE), particularly in precision-critical tasks. Our key contribution is the first application of deep expert-led error analysis to scientific IE evaluation, enabling quantitative characterization of large language models’ reliability boundaries in authentic research settings. We further propose a scalable, expert-validated benchmark for scientific IE assessment, advancing methodological foundations for high-fidelity AI-assisted scientific discovery.
📝 Abstract
We explore the ability of GPT-4 to perform ad-hoc schema based information extraction from scientific literature. We assess specifically whether it can, with a basic prompting approach, replicate two existing material science datasets, given the manuscripts from which they were originally manually extracted. We employ materials scientists to perform a detailed manual error analysis to assess where the model struggles to faithfully extract the desired information, and draw on their insights to suggest research directions to address this broadly important task.