VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
It remains unclear whether large language models’ (LLMs) mathematical reasoning capabilities reflect genuine generalization or merely overfit to benchmark-specific patterns. Method: We propose VAR-MATH, a symbolic multi-instance evaluation framework that abstracts numeric problems from AMC23 and AIME24 into semantically invariant symbolic templates, generating numerically diverse yet logically equivalent problem variants to construct contamination-resistant test sets—VAR-AMC23 and VAR-AIME24. This framework enables the first robust assessment of reasoning consistency, mitigating data leakage and evaluation fragility. Contribution/Results: Experiments reveal that reinforcement learning–fine-tuned models suffer mean performance drops of 48.0% on VAR-AMC23 and 58.3% on VAR-AIME24, indicating heavy reliance on superficial heuristics rather than structured reasoning. VAR-MATH establishes a scalable, overfitting-resistant benchmark for rigorously evaluating LLMs’ authentic mathematical reasoning capacity.

Technology Category

Application Category

📝 Abstract
Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of large language models (LLMs), as measured by standard benchmarks. However, these gains often persist even when models are trained with flawed signals, such as random or inverted rewards, raising a fundamental question: do such improvements reflect true reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To address this question, we take an evaluation-centric perspective and identify two critical shortcomings in existing protocols. First, emph{benchmark contamination} arises from the public availability of test problems, increasing the risk of data leakage. Second, emph{evaluation fragility} stems from the reliance on single-instance assessments, which are highly sensitive to stochastic outputs and fail to capture reasoning consistency. To overcome these limitations, we introduce {VAR-MATH}, a symbolic evaluation framework designed to probe genuine reasoning ability. By converting fixed numerical problems into symbolic templates and requiring models to solve multiple instantiations of each, VAR-MATH enforces consistent reasoning across structurally equivalent variants, thereby mitigating contamination and improving evaluation robustness. We apply VAR-MATH to transform two popular benchmarks, AMC23 and AIME24, into their symbolic counterparts, VAR-AMC23 and VAR-AIME24. Experimental results reveal substantial performance drops for RL-trained models on the variabilized versions, especially for smaller models, with average declines of 48.0% on AMC23 and 58.3% on AIME24. These findings suggest that many existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms. Overall, VAR-MATH offers a principled, contamination-resistant evaluation paradigm for mathematical reasoning.
Problem

Research questions and friction points this paper is trying to address.

Assess true mathematical reasoning in LLMs beyond benchmark overfitting
Address benchmark contamination from public test data leakage
Mitigate evaluation fragility via multi-instance consistency checks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Symbolic evaluation framework for reasoning
Converts fixed problems to symbolic templates
Enforces consistent reasoning across variants
🔎 Similar Papers
No similar papers found.
Jian Yao
Jian Yao
Wuhan University
Computer VisionAI3DRoboticsSLAM
R
Ran Cheng
Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR, China
K
Kay Chen Tan
Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR, China