ChemPro: A Progressive Chemistry Benchmark for Large Language Models

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes the first chemistry evaluation framework structured along a progression of cognitive difficulty, introducing a high-quality benchmark dataset comprising 4,100 natural language question-answer pairs spanning four major branches of chemistry and mirroring the learning trajectory from foundational to high school levels. Employing a multidimensional difficulty grading strategy, the study systematically evaluates 52 prominent large language models across capabilities including factual recall, integration of multiple concepts, long-range reasoning, and problem solving. The findings reveal that while models perform well on basic questions, their accuracy drops substantially on higher-order reasoning tasks, exposing critical limitations in scientific comprehension and complex inference.

Technology Category

Application Category

📝 Abstract
We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student's academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Chemistry Benchmark
Scientific Reasoning
Question Difficulty
Model Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive benchmark
chemistry reasoning
large language models
multi-concept questions
scientific evaluation
🔎 Similar Papers
No similar papers found.