QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) lack systematic evaluation of quantitative chemical reasoning—particularly multi-step numerical computation—hindering progress in scientific AI. Method: We introduce QCBench, the first hierarchical benchmark for quantitative chemical calculation, comprising 350 authentic, reasoning-intensive computational problems spanning seven subdomains and three difficulty levels. It enables fine-grained diagnostic analysis of computational errors and quantifies the gap between linguistic fluency and scientific accuracy. Contribution/Results: Evaluating 19 state-of-the-art LLMs reveals a sharp performance decline with increasing computational complexity, exposing fundamental limitations in higher-order quantitative reasoning. QCBench establishes a new standard for domain-specific LLM evaluation and optimization in chemistry, providing both a rigorous assessment framework and empirical foundation for advancing scientific reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Quantitative chemistry plays a fundamental role in chemistry research, enabling precise predictions of molecular properties, reaction outcomes, and material behaviors. While large language models (LLMs) have shown promise in chemistry-related tasks, their ability to perform rigorous, step-by-step quantitative reasoning remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry benchmark comprising 350 computational chemistry problems across 7 chemistry subfields (analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry), categorized into three hierarchical tiers-basic, intermediate, and expert-to systematically evaluate the mathematical reasoning abilities of large language models (LLMs). Designed to minimize shortcuts and emphasize stepwise numerical reasoning, each problem focuses on pure calculations rooted in real-world chemical vertical fields. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain adaptive fine-tuning or multi-modal integration. Evaluations on 19 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on quantitative chemistry reasoning tasks
Assessing step-by-step mathematical abilities in chemistry subfields
Identifying performance gaps in scientific computation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

QCBench benchmark for quantitative chemistry evaluation
Hierarchical problem tiers assess mathematical reasoning
Domain adaptive fine-tuning for future improvements
🔎 Similar Papers
No similar papers found.