ChemPro: A Progressive Chemistry Benchmark for Large Language Models

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work proposes the first chemistry evaluation framework structured along a progression of cognitive difficulty, introducing a high-quality benchmark dataset comprising 4,100 natural language question-answer pairs spanning four major branches of chemistry and mirroring the learning trajectory from foundational to high school levels. Employing a multidimensional difficulty grading strategy, the study systematically evaluates 52 prominent large language models across capabilities including factual recall, integration of multiple concepts, long-range reasoning, and problem solving. The findings reveal that while models perform well on basic questions, their accuracy drops substantially on higher-order reasoning tasks, exposing critical limitations in scientific comprehension and complex inference.

Technology Category

Application Category

📝 Abstract

We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student's academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Chemistry Benchmark

Scientific Reasoning

Question Difficulty

Model Evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive benchmark

chemistry reasoning

large language models