🤖 AI Summary
Large language models (LLMs) exhibit shallow domain understanding, weak reasoning capabilities, and poor interpretability in chemistry. Method: We propose an atomic chemical knowledge representation and hybrid-source distillation framework: (1) constructing a fine-grained, structured atomic chemical knowledge dataset; and (2) integrating expert rule injection, multi-source knowledge distillation (from general corpora and domain-specific texts), and chemistry-aware reinforcement learning to guide the generation of traceable, logically coherent reasoning chains. Contribution/Results: Our approach significantly improves accuracy and transparency in chemical reaction prediction and molecular property inference, achieving state-of-the-art performance across multiple chemical benchmarks. Crucially, the generated reasoning processes support human-in-the-loop verification, ensuring scientific rigor while maintaining practical applicability.
📝 Abstract
While large language models (LLMs) have achieved impressive progress, their application in scientific domains such as chemistry remains hindered by shallow domain understanding and limited reasoning capabilities. In this work, we focus on the specific field of chemistry and develop a Chemical Reasoner LLM, ChemDFM-R. We first construct a comprehensive dataset of atomized knowledge points to enhance the model's understanding of the fundamental principles and logical structure of chemistry. Then, we propose a mix-sourced distillation strategy that integrates expert-curated knowledge with general-domain reasoning skills, followed by domain-specific reinforcement learning to enhance chemical reasoning. Experiments on diverse chemical benchmarks demonstrate that ChemDFM-R achieves state-of-the-art performance while providing interpretable, rationale-driven outputs. Further case studies illustrate how explicit reasoning chains significantly improve the reliability, transparency, and practical utility of the model in real-world human-AI collaboration scenarios.