🤖 AI Summary
Large language models (LLMs) face two critical bottlenecks in chemical intelligence: scarcity of high-quality, chemistry-specific instruction-response data, and the inadequacy of existing synthetic data generation methods in capturing chemistry’s hierarchical, rule-governed knowledge structure. Method: This paper introduces ChemInstruct, a novel two-stage controllable data synthesis framework. Stage I generates domain-informed instructions using chemistry rules and task templates; Stage II produces accurate, diverse, tool-aware responses via integrated tool planning, response distillation, and self-repair mechanisms. Contribution/Results: Compared to generic synthetic approaches, ChemInstruct significantly improves LLM performance on complex chemistry tasks—including molecular property prediction and reaction pathway reasoning—enhancing both accuracy and generalization. Moreover, it systematically exposes latent flaws in chemical reasoning, thereby establishing a scalable, high-fidelity data infrastructure for training domain-specialized LLMs.
📝 Abstract
Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction-response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks, and ensures response precision through tool planning and distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the reliable generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and 3) the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs.