🤖 AI Summary
To address the low accuracy of structured data extraction from materials science literature, this paper proposes a hybrid text-mining framework. First, symbolic entity markers are introduced to enhance named entity recognition (NER) performance; subsequently, a joint modeling approach integrates sequence labeling with structured generation to enable collaborative extraction of entities and relations. This method innovatively combines the strengths of multi-stage and end-to-end paradigms, overcoming traditional limitations in fine-grained entity identification and complex relational modeling. Evaluated on three authoritative benchmark datasets—MatScholar, SOFC, and one additional domain-specific corpus—the framework achieves a 58% improvement in entity-level F1 score and an 83% improvement in relation-level F1 score over state-of-the-art methods. The proposed approach establishes a new, efficient, and robust paradigm for constructing scientific literature knowledge graphs.
📝 Abstract
The construction of experimental datasets is essential for expanding the scope of data-driven scientific discovery. Recent advances in natural language processing (NLP) have facilitated automatic extraction of structured data from unstructured scientific literature. While existing approaches-multi-step and direct methods-offer valuable capabilities, they also come with limitations when applied independently. Here, we propose a novel hybrid text-mining framework that integrates the advantages of both methods to convert unstructured scientific text into structured data. Our approach first transforms raw text into entity-recognized text, and subsequently into structured form. Furthermore, beyond the overall data structuring framework, we also enhance entity recognition performance by introducing an entity marker-a simple yet effective technique that uses symbolic annotations to highlight target entities. Specifically, our entity marker-based hybrid approach not only consistently outperforms previous entity recognition approaches across three benchmark datasets (MatScholar, SOFC, and SOFC slot NER) but also improve the quality of final structured data-yielding up to a 58% improvement in entity-level F1 score and up to 83% improvement in relation-level F1 score compared to direct approach.