UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) undergraduate-level mathematical reasoning due to narrow topic coverage, lack of problem variants, and oversimplified evaluation metrics. To address this, we propose UGMathBench—the first dynamic, undergraduate-focused mathematical reasoning benchmark. It spans 16 disciplines, 111 topics, and 5,062 problems, each with three randomly generated variants, supporting multiple answer formats and contamination-resistant evaluation. Methodologically, we introduce two novel metrics: Effective Accuracy (EAcc), measuring solution consistency, and Reasoning Gap (Δ), quantifying robustness across variants; together they jointly evaluate reliability and stability. The benchmark is extensible, enabling continuous updates aligned with model advancements. Built upon structured annotation, multi-variant stochastic generation, and a standardized evaluation framework, we empirically assess 23 mainstream LLMs. Results show OpenAI-o1-mini achieves the highest EAcc (56.3%), yet all models exhibit substantial Δ, revealing persistent instability in mathematical reasoning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking extensive coverage of undergraduate-level mathematical problems or probably suffering from test-set contamination. To address these issues, we introduce UGMathBench, a diverse and dynamic benchmark specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench. Furthermore, we propose two key metrics: effective accuracy (EAcc), which measures the percentage of correctly solved problems across all three versions, and reasoning gap ($Delta$), which assesses reasoning robustness by calculating the difference between the average accuracy across all versions and EAcc. Our extensive evaluation of 23 leading LLMs reveals that the highest EAcc achieved is 56.3% by OpenAI-o1-mini, with large $Delta$ values observed across different models. This highlights the need for future research aimed at developing"large reasoning models"with high EAcc and $Delta = 0$. We anticipate that the release of UGMathBench, along with its detailed evaluation codes, will serve as a valuable resource to advance the development of LLMs in solving mathematical problems.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
University-Level Math Problems
Assessment Methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

UGMathBench
EAcc
Δ
🔎 Similar Papers
No similar papers found.
X
Xin Xu
Department of Mathematics, The Hong Kong University of Science and Technology
J
Jiaxin Zhang
Department of Mathematics, The Hong Kong University of Science and Technology
Tianhao Chen
Tianhao Chen
Phd student, Zhejiang University
Geotechnical engineering
Z
Zitong Chao
Department of Mathematics, The Hong Kong University of Science and Technology
J
Jishan Hu
Department of Mathematics, The Hong Kong University of Science and Technology
Can Yang
Can Yang
Hong Kong University of Science and Technology
Statistical Machine LearningStatistical Genetics and Genomics