🤖 AI Summary
This study addresses the challenge of fragmented and heterogeneous material data in scientific literature, which hinders efficient and accurate manual extraction. The authors propose a hierarchical, priority-driven workflow leveraging large language models to automatically reconstruct structured data from shock physics experiments by integrating textual content, tables, figures, and physical laws through a three-tier strategy: direct extraction, physics-based derivation, and chart digitization. The method employs prompt-driven large models, physical consistency checks, and unit normalization, operating end-to-end without fine-tuning and supporting API deployment. Evaluated on 30 papers encompassing 11,967 data points, the approach achieves an overall weighted accuracy of 94.69% (with tier-specific accuracies of 94.93%, 92.04%, and 83.49% for levels T1, T2, and T3, respectively), demonstrating high precision, traceability, and scalability.
📝 Abstract
Scientific data are widely dispersed across research articles and are often reported inconsistently across text, tables, and figures, making manual data extraction and aggregation slow and error-prone. We present a prompt-driven, hierarchical workflow that uses a large language model (LLM) to automatically extract and reconstruct structured, shot-level shock-physics experimental records by integrating information distributed across text, tables, figures, and physics-based derivations from full-text published research articles, using alloy spall strength as a representative case study. The pipeline targeted 37 experimentally relevant fields per shot and applied a three-level priority strategy: (T1) direct extraction from text/tables, (T2) physics-based derivation using verified governing relations, and (T3) digitization from figures when necessary. Extracted values were normalized to canonical units, tagged by priority for traceability, and validated with physics-based consistency and plausibility checks. Evaluated on a benchmark of 30 published research articles comprising 11,967 evaluated data points, the workflow achieved high overall accuracy, with priority-wise accuracies of 94.93% (T1), 92.04% (T2), and 83.49% (T3), and an overall weighted accuracy of 94.69%. Cross-model testing further indicated strong agreement for text/table and equation-derived fields, with lower agreement for figure-based extraction. Implementation through an API interface demonstrated the scalability of the approach, achieving consistent extraction performance and, in a subset of test cases, matching or exceeding chat-based accuracy. This workflow demonstrates a practical approach for converting unstructured technical literature into traceable, analysis-ready datasets without task-specific fine-tuning, enabling scalable database construction in materials science.