MatSKRAFT: A framework for large-scale materials knowledge extraction from scientific tables

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Extracting and leveraging semi-structured tabular knowledge from materials science literature remains challenging due to the lack of systematic, domain-aware methods. Method: This paper proposes a scientific-prior-guided graph neural network (GNN) approach that converts tables into heterogeneous graphs explicitly incorporating physical constraints—such as conservation laws and dimensional consistency—into the model architecture. Contribution/Results: The method achieves end-to-end parsing on 69,000 scientific tables, constructing a knowledge base of 535,000 records, including 104,000 material compositions absent from existing databases. It attains an 88.68% F1 score for property extraction and accelerates inference by 19–496× over state-of-the-art large language models. By bridging domain physics with graph representation learning, the approach significantly advances data-driven research into structure–property relationships in materials science.

Technology Category

Application Category

📝 Abstract
Scientific progress increasingly depends on synthesizing knowledge across vast literature, yet most experimental data remains trapped in semi-structured formats that resist systematic extraction and analysis. Here, we present MatSKRAFT, a computational framework that automatically extracts and integrates materials science knowledge from tabular data at unprecedented scale. Our approach transforms tables into graph-based representations processed by constraint-driven GNNs that encode scientific principles directly into model architecture. MatSKRAFT significantly outperforms state-of-the-art large language models, achieving F1 scores of 88.68 for property extraction and 71.35 for composition extraction, while processing data $19$-$496 imes$ faster than them (compared to the slowest and the fastest models, respectively) with modest hardware requirements. Applied to nearly 69,000 tables from more than 47,000 research publications, we construct a comprehensive database containing over 535,000 entries, including 104,000 compositions that expand coverage beyond major existing databases, pending manual validation. This systematic approach reveals previously overlooked materials with distinct property combinations and enables data-driven discovery of composition-property relationships forming the cornerstone of materials and scientific discovery.
Problem

Research questions and friction points this paper is trying to address.

Extracting materials science knowledge from semi-structured tables
Automating large-scale data integration from scientific literature
Enabling systematic discovery of composition-property relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based table representations using constraint-driven GNNs
Automated extraction from scientific tabular data
Direct encoding of scientific principles into architecture
🔎 Similar Papers
No similar papers found.