🤖 AI Summary
Large language models (LLMs) lack systematic evaluation of quantitative chemical reasoning—particularly multi-step numerical computation—hindering progress in scientific AI.
Method: We introduce QCBench, the first hierarchical benchmark for quantitative chemical calculation, comprising 350 authentic, reasoning-intensive computational problems spanning seven subdomains and three difficulty levels. It enables fine-grained diagnostic analysis of computational errors and quantifies the gap between linguistic fluency and scientific accuracy.
Contribution/Results: Evaluating 19 state-of-the-art LLMs reveals a sharp performance decline with increasing computational complexity, exposing fundamental limitations in higher-order quantitative reasoning. QCBench establishes a new standard for domain-specific LLM evaluation and optimization in chemistry, providing both a rigorous assessment framework and empirical foundation for advancing scientific reasoning capabilities.
📝 Abstract
Quantitative chemistry plays a fundamental role in chemistry research, enabling precise predictions of molecular properties, reaction outcomes, and material behaviors. While large language models (LLMs) have shown promise in chemistry-related tasks, their ability to perform rigorous, step-by-step quantitative reasoning remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry benchmark comprising 350 computational chemistry problems across 7 chemistry subfields (analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry), categorized into three hierarchical tiers-basic, intermediate, and expert-to systematically evaluate the mathematical reasoning abilities of large language models (LLMs). Designed to minimize shortcuts and emphasize stepwise numerical reasoning, each problem focuses on pure calculations rooted in real-world chemical vertical fields. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain adaptive fine-tuning or multi-modal integration. Evaluations on 19 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy.