Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models

πŸ“… 2025-10-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing material literature information extraction methods suffer from narrow feature coverage and difficulty in modeling multidimensional correlations among composition, processing, microstructure, and properties (C–P–M–P). To address this, we propose a large language model–based multi-stage source-tracing extraction framework. It performs iterative entity recognition, disambiguation, relation inference, and provenance verification to end-to-end extract comprehensive experimental data spanning 47 fine-grained features. Our approach achieves the first systematic joint modeling and traceable extraction of the full C–P–M–P chain. It attains F1 scores of 0.96 at both feature-level and tuple-level evaluation; microstructure-related F1 improves by 10.0–13.7%; the number of missed materials drops from 49 to 13; and zero false positives are observed. The framework significantly enhances the completeness, accuracy, and interpretability of materials databases.

Technology Category

Application Category

πŸ“ Abstract
Data-driven materials discovery requires large-scale experimental datasets, yet most of the information remains trapped in unstructured literature. Existing extraction efforts often focus on a limited set of features and have not addressed the integrated composition-processing-microstructure-property relationships essential for understanding materials behavior, thereby posing challenges for building comprehensive databases. To address this gap, we propose a multi-stage information extraction pipeline powered by large language models, which captures 47 features spanning composition, processing, microstructure, and properties exclusively from experimentally reported materials. The pipeline integrates iterative extraction with source tracking to enhance both accuracy and reliability. Evaluations at the feature level (independent attributes) and tuple level (interdependent features) yielded F1 scores around 0.96. Compared with single-pass extraction without source tracking, our approach improved F1 scores of microstructure category by 10.0% (feature level) and 13.7% (tuple level), and reduced missed materials from 49 to 13 out of 396 materials in 100 articles on precipitate-containing multi-principal element alloys (miss rate reduced from 12.4% to 3.3%). The pipeline enables scalable and efficient literature mining, producing databases with high precision, minimal omissions, and zero false positives. These datasets provide trustworthy inputs for machine learning and materials informatics, while the modular design generalizes to diverse material classes, enabling comprehensive materials information extraction.
Problem

Research questions and friction points this paper is trying to address.

Extracting comprehensive materials data trapped in unstructured scientific literature
Capturing integrated composition-processing-microstructure-property relationships from experiments
Building reliable databases with minimal omissions and zero false positives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage LLM pipeline extracts 47 material features
Integrates source tracking for enhanced accuracy
Modular design generalizes across diverse material classes
πŸ”Ž Similar Papers
No similar papers found.
X
Xin Wang
School of Library and Information Studies, University of Alabama, Tuscaloosa, 35487, AL, USA.
A
Anshu Raj
School of Aerospace and Mechanical Engineering, University of Oklahoma, Norman, 73019, OK, USA.
M
Matthew Luebbe
Department of Materials Science and Engineering, Missouri University of Science and Technology, Rolla, 65409, MO, USA.
H
Haiming Wen
Department of Materials Science and Engineering, Missouri University of Science and Technology, Rolla, 65409, MO, USA.
S
Shuozhi Xu
School of Aerospace and Mechanical Engineering, University of Oklahoma, Norman, 73019, OK, USA.
Kun Lu
Kun Lu
University of Alabama
Applied natural language processingLarge language modelsText mining