VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

It remains unclear whether large language models’ (LLMs) mathematical reasoning capabilities reflect genuine generalization or merely overfit to benchmark-specific patterns. Method: We propose VAR-MATH, a symbolic multi-instance evaluation framework that abstracts numeric problems from AMC23 and AIME24 into semantically invariant symbolic templates, generating numerically diverse yet logically equivalent problem variants to construct contamination-resistant test sets—VAR-AMC23 and VAR-AIME24. This framework enables the first robust assessment of reasoning consistency, mitigating data leakage and evaluation fragility. Contribution/Results: Experiments reveal that reinforcement learning–fine-tuned models suffer mean performance drops of 48.0% on VAR-AMC23 and 58.3% on VAR-AIME24, indicating heavy reliance on superficial heuristics rather than structured reasoning. VAR-MATH establishes a scalable, overfitting-resistant benchmark for rigorously evaluating LLMs’ authentic mathematical reasoning capacity.

Technology Category

Application Category

📝 Abstract

Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of large language models (LLMs), as measured by standard benchmarks. However, these gains often persist even when models are trained with flawed signals, such as random or inverted rewards, raising a fundamental question: do such improvements reflect true reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To address this question, we take an evaluation-centric perspective and identify two critical shortcomings in existing protocols. First, emph{benchmark contamination} arises from the public availability of test problems, increasing the risk of data leakage. Second, emph{evaluation fragility} stems from the reliance on single-instance assessments, which are highly sensitive to stochastic outputs and fail to capture reasoning consistency. To overcome these limitations, we introduce {VAR-MATH}, a symbolic evaluation framework designed to probe genuine reasoning ability. By converting fixed numerical problems into symbolic templates and requiring models to solve multiple instantiations of each, VAR-MATH enforces consistent reasoning across structurally equivalent variants, thereby mitigating contamination and improving evaluation robustness. We apply VAR-MATH to transform two popular benchmarks, AMC23 and AIME24, into their symbolic counterparts, VAR-AMC23 and VAR-AIME24. Experimental results reveal substantial performance drops for RL-trained models on the variabilized versions, especially for smaller models, with average declines of 48.0% on AMC23 and 58.3% on AIME24. These findings suggest that many existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms. Overall, VAR-MATH offers a principled, contamination-resistant evaluation paradigm for mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Assess true mathematical reasoning in LLMs beyond benchmark overfitting

Address benchmark contamination from public test data leakage

Mitigate evaluation fragility via multi-instance consistency checks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Symbolic evaluation framework for reasoning

Converts fixed problems to symbolic templates

Enforces consistent reasoning across variants

🔎 Similar Papers

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models