An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit significant robustness deficiencies in advanced mathematical reasoning, yet existing benchmarks largely ignore the challenge of “mathematical equivalence under heterogeneous surface representations.” Method: We propose the first robustness evaluation framework grounded in mathematically equivalent transformations, introducing PutnamGAP—a competition-level benchmark comprising three categories of non-mathematical perturbations: linguistic rephrasing, parameter perturbation, and variation in core reasoning steps. Contribution/Results: Systematic evaluation across 18 state-of-the-art models reveals substantial performance degradation—even when semantic meaning and ground-truth answers remain unchanged: OpenAI o3 suffers a 4–10.5 percentage-point accuracy drop, with smaller models exhibiting even steeper declines. This work quantifies, for the first time, LLMs’ pronounced sensitivity to superficial problem formulations in mathematical reasoning, establishing a novel paradigm and benchmark for robust reasoning evaluation and modeling.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 49 % on the originals but drops by 4 percentage points on surface variants, and by 10.5 percentage points on core-step-based variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' robustness in mathematical reasoning
Measuring sensitivity to non-mathematical perturbations in math problems
Evaluating performance degradation on mathematically-equivalent problem variants
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic framework for LLM robustness assessment
Mathematically-equivalent problem transformations for testing
New benchmark dataset PutnamGAP for evaluation
🔎 Similar Papers
No similar papers found.