ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of low-efficiency, poor verifiability, and difficulty in structured extraction of ceramic piezoelectric material data—particularly chemical compositions and the piezoelectric strain coefficient *d*₃₃—from scientific literature, this work proposes a multi-agent collaborative framework. It integrates open- and closed-source large language models (LLMs), rule-based engines, and consistency verification mechanisms to enable end-to-end automated extraction, validation, classification, and visualization of composition–property data. Evaluated on 100 peer-reviewed journal articles—with DeepSeek-V3-0324 as the primary LLM—the framework achieves an overall accuracy of 82% and constructs the first high-quality, reusable, machine-learning-ready structured database for piezoelectric ceramics. Its core innovation lies in pioneering the application of multi-agent collaboration to materials literature data engineering, uniquely balancing deep semantic understanding with rigorous physical constraints—thereby significantly improving both the accuracy and reproducibility of automated mining of complex experimental data.

Technology Category

Application Category

📝 Abstract
Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user-friendly, readily-usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.
Problem

Research questions and friction points this paper is trying to address.

Extracts structured composition-property data from scientific literature
Automates validation and visualization of experimental material datasets
Addresses scarcity of tools for building machine learning-ready databases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent platform for structured data extraction
Autonomous system integrating validation and visualization
Framework tested across multiple LLMs for accuracy
A
Aritra Roy
Energy, Materials and Environment Research Centre, London South Bank University, London SE1 0AA, UK.; School of Engineering and Design, London South Bank University, London SE1 0AA, UK.
Enrico Grisan
Enrico Grisan
London South Bank University
biomedical imaging
J
John Buckeridge
Energy, Materials and Environment Research Centre, London South Bank University, London SE1 0AA, UK.; School of Engineering and Design, London South Bank University, London SE1 0AA, UK.
C
Chiara Gattinoni
Department of Physics, Kings College London, London WC2R 2LS, UK.