ChemOrch: Empowering LLMs with Chemical Intelligence via Synthetic Instructions

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face two critical bottlenecks in chemical intelligence: scarcity of high-quality, chemistry-specific instruction-response data, and the inadequacy of existing synthetic data generation methods in capturing chemistry’s hierarchical, rule-governed knowledge structure. Method: This paper introduces ChemInstruct, a novel two-stage controllable data synthesis framework. Stage I generates domain-informed instructions using chemistry rules and task templates; Stage II produces accurate, diverse, tool-aware responses via integrated tool planning, response distillation, and self-repair mechanisms. Contribution/Results: Compared to generic synthetic approaches, ChemInstruct significantly improves LLM performance on complex chemistry tasks—including molecular property prediction and reaction pathway reasoning—enhancing both accuracy and generalization. Moreover, it systematically exposes latent flaws in chemical reasoning, thereby establishing a scalable, high-fidelity data infrastructure for training domain-specialized LLMs.

Technology Category

Application Category

📝 Abstract
Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction-response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks, and ensures response precision through tool planning and distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the reliable generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and 3) the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addressing the scarcity of high-quality chemical instruction datasets for LLMs
Aligning synthetic data generation with hierarchical chemical information structure
Enhancing LLM chemical intelligence through controllable task generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage synthetic instruction generation process
Tool-aware response construction with planning mechanisms
Controllable task diversity and difficulty levels
🔎 Similar Papers
No similar papers found.
Y
Yue Huang
Department of Computer Science and Engineering, University of Notre Dame
Z
Zhengzhe Jiang
X
Xiaonan Luo
Department of Computer Science and Engineering, University of Notre Dame
Kehan Guo
Kehan Guo
University of Notre Dame
LLMMachine ReasoningGenerative ModelsXAIAI for Science
Haomin Zhuang
Haomin Zhuang
University of Notre Dame
Yujun Zhou
Yujun Zhou
University of Notre Dame
Trustworthy LLMLLM ReasoninngAdversarial Machine Learning
Zhengqing Yuan
Zhengqing Yuan
PhD student, University of Notre Dame
NLPDeeplearningCV
X
Xiaoqi Sun
MIT
J
Jules Schleinitz
CalTech
Y
Yanbo Wang
MBZUAI
S
Shuhao Zhang
CMU
M
Mihir Surve
Department of Chemistry & Biochemistry, University of Notre Dame
N
Nitesh V Chawla
Department of Computer Science and Engineering, University of Notre Dame
Olaf Wiest
Olaf Wiest
University of Notre Dame
reaction mechanismscomputational medicinal and organic chemistry
Xiangliang Zhang
Xiangliang Zhang
Leonard C. Bettex Collegiate Professor, Computer Science and Engineering, University of Notre Dame
Machine LearningAI for Science