A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the scarcity of large-scale, high-quality datasets aligning molecular structures with natural language descriptions—a key limitation for large language models (LLMs) in chemical tasks. The authors propose a fully automated annotation framework that, for the first time, enables the construction of massive molecule–language alignment data without human intervention. By extending a rule-based IUPAC name parser to generate structured XML metadata, they guide LLMs to produce molecular descriptions that are both chemically accurate and linguistically fluent. The resulting dataset comprises approximately 163,000 molecule–description pairs, achieving a 98.6% accuracy rate in joint expert–LLM evaluation on a 2,000-sample subset. This resource significantly advances research in chemical language understanding and generation.

Technology Category

Application Category

📝 Abstract

Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6\%$. The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.

Problem

Research questions and friction points this paper is trying to address.

molecular structure

natural language description

large-scale dataset

structure-language alignment

IUPAC nomenclature

Innovation

Methods, ideas, or system contributions that make the work stand out.

rule-regularized annotation

molecular structure-language alignment

IUPAC parsing