LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Critical materials synthesis knowledge is fragmented across vast volumes of unstructured scientific literature, severely impeding intelligent materials discovery. Method: We propose the first multimodal automated extraction framework for materials synthesis, integrating large language models (LLMs) and vision-language models (VLMs) with ontology-driven data modeling, LLM-as-a-judge quality assessment, and expert curation to jointly extract synthesis procedures, reaction conditions, and performance metrics from both text and figures with high accuracy. Contribution/Results: We construct LeMat-Synth (v1.0), a large-scale, structured dataset covering 35 synthesis methods and 16 material classes, comprising 25,000 high-quality synthesis protocols derived from 81,000 open-access papers. We also release a modular, open-source tool library to enable community-driven extension. This work establishes a scalable foundational infrastructure for modeling synthesis–structure–property relationships and enabling predictive materials design.

Technology Category

Application Category

📝 Abstract
The development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source software library designed to support community-driven extension to new corpora and synthesis domains. Altogether, this work provides an extensible infrastructure to transform unstructured literature into machine-readable information. This lays the groundwork for predictive modeling of synthesis procedures as well as modeling synthesis--structure--property relationships.
Problem

Research questions and friction points this paper is trying to address.

Extracting synthesis procedures from unstructured scientific literature using AI
Organizing materials synthesis data across text and figures systematically
Creating machine-readable database to model synthesis-structure-property relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs and VLMs for data extraction
Creates structured synthesis database from literature
Provides open-source modular software for extension
🔎 Similar Papers
No similar papers found.
M
Magdalena Lederbauer
Entalpic
S
Siddharth Betala
Entalpic
X
Xiyao Li
Entalpic
A
Ayush Jain
Georgia Institute of Technology
A
Amine Sehaba
ENSA Lyon UMR-MAP Aria
G
Georgia Channing
Hugging Face
G
Grégoire Germain
Entalpic
A
Anamaria Leonescu
Entalpic
F
Faris Flaifil
Independent Researcher
Alfonso Amayuelas
Alfonso Amayuelas
University of California, Santa Barbara
Artificial IntelligenceNatural Language ProcessingMachine LearningLarge Language Models
A
Alexandre Nozadze
Swiss Federal Institute of Technology Zurich, Paul Scherrer Institute
S
Stefan P. Schmid
Swiss Federal Institute of Technology Zurich
Mohd Zaki
Mohd Zaki
Postdoctoral Researcher, Hopkins Extreme Materials Institute, Johns Hopkins University
Civil EngineeringMaterial ScienceMachine Learning
S
Sudheesh Kumar Ethirajan
University of California, Davis
Elton Pan
Elton Pan
PhD Candidate, MIT
generative modelsreinforcement learningmaterials informaticsmaterials synthesis
M
Mathilde Franckel
Entalpic
A
Alexandre Duval
Entalpic
N. M. Anoop Krishnan
N. M. Anoop Krishnan
Associate Professor, Indian Institute of Technology Delhi
AI for MaterialsAI4ScienceMachine LearningAtomistic ModelingGlass
S
Samuel P. Gleason
Entalpic