LLM-Driven Multi-Agent Curation and Expansion of Metal-Organic Frameworks Database

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Nearly half of the entries in major MOF databases contain severe structural errors, critically impeding high-throughput screening and machine learning applications. To address this, we propose LitMOF—a novel multi-agent framework powered by large language models (LLMs) that automatically parses, validates, and cross-verifies crystallographic information directly from raw scientific literature against existing databases, enabling error correction and data augmentation. LitMOF establishes a “literature-driven, self-correcting” paradigm for scientific database curation, integrating LLM-based reasoning, collaborative multi-agent orchestration, natural language processing, and domain-specific crystallographic parsing. Applied to a CSD-derived MOF subset, LitMOF corrected 69% of invalid structures, yielding LitMOF-DB—a computationally ready database of 118,464 MOFs—and newly incorporated 12,646 experimentally reported MOFs previously absent from public repositories. This significantly enhances MOF data accuracy, completeness, and computational readiness.

Technology Category

Application Category

📝 Abstract
Metal-organic framework (MOF) databases have grown rapidly through experimental deposition and large-scale literature extraction, but recent analyses show that nearly half of their entries contain substantial structural errors. These inaccuracies propagate through high-throughput screening and machine-learning workflows, limiting the reliability of data-driven MOF discovery. Correcting such errors is exceptionally difficult because true repairs require integrating crystallographic files, synthesis descriptions, and contextual evidence scattered across the literature. Here we introduce LitMOF, a large language model-driven multi-agent framework that validates crystallographic information directly from the original literature and cross-validates it with database entries to repair structural errors. Applying LitMOF to the experimental MOF database (the CSD MOF Subset), we constructed LitMOF-DB, a curated set 118,464 computation-ready structures, including corrections of 69% (6,161 MOFs) of the invalid MOFs in the latest CoRE MOF database. Additionally, the system uncovered 12,646 experimentally reported MOFs absent from existing resources, substantially expanding the known experimental design space. This work establishes a scalable pathway toward self-correcting scientific databases and a generalizable paradigm for LLM-driven curation in materials science.
Problem

Research questions and friction points this paper is trying to address.

Corrects structural errors in MOF databases using LLM-driven validation.
Expands known experimental MOF design space by discovering unreported structures.
Provides a scalable, self-correcting framework for materials science databases.
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven multi-agent framework validates crystallographic data
Cross-validates literature with database entries to repair errors
Constructs curated database and expands known experimental design space
H
Honghui Kim
Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
D
Dohoon Kim
Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
Jihan Kim
Jihan Kim
KAIST