🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) undergraduate-level mathematical reasoning due to narrow topic coverage, lack of problem variants, and oversimplified evaluation metrics. To address this, we propose UGMathBench—the first dynamic, undergraduate-focused mathematical reasoning benchmark. It spans 16 disciplines, 111 topics, and 5,062 problems, each with three randomly generated variants, supporting multiple answer formats and contamination-resistant evaluation. Methodologically, we introduce two novel metrics: Effective Accuracy (EAcc), measuring solution consistency, and Reasoning Gap (Δ), quantifying robustness across variants; together they jointly evaluate reliability and stability. The benchmark is extensible, enabling continuous updates aligned with model advancements. Built upon structured annotation, multi-variant stochastic generation, and a standardized evaluation framework, we empirically assess 23 mainstream LLMs. Results show OpenAI-o1-mini achieves the highest EAcc (56.3%), yet all models exhibit substantial Δ, revealing persistent instability in mathematical reasoning.
📝 Abstract
Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking extensive coverage of undergraduate-level mathematical problems or probably suffering from test-set contamination. To address these issues, we introduce UGMathBench, a diverse and dynamic benchmark specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench. Furthermore, we propose two key metrics: effective accuracy (EAcc), which measures the percentage of correctly solved problems across all three versions, and reasoning gap ($Delta$), which assesses reasoning robustness by calculating the difference between the average accuracy across all versions and EAcc. Our extensive evaluation of 23 leading LLMs reveals that the highest EAcc achieved is 56.3% by OpenAI-o1-mini, with large $Delta$ values observed across different models. This highlights the need for future research aimed at developing"large reasoning models"with high EAcc and $Delta = 0$. We anticipate that the release of UGMathBench, along with its detailed evaluation codes, will serve as a valuable resource to advance the development of LLMs in solving mathematical problems.