RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical reasoning evaluation methods often generate ill-posed problems and lack dynamic, semantically well-defined benchmarks with controllable difficulty. Method: We propose RIDE, the first framework to integrate Item Response Theory (IRT) into large language model (LLM) mathematical ability assessment, enabling quantitative difficulty estimation; it further employs reinforcement learning–driven adversarial rewriting to achieve multi-level difficulty evolution while preserving semantic integrity. Contribution/Results: Using simulated response behaviors from 35 models to train an IRT-based difficulty scorer as a reward signal, RIDE constructs an interpretable and scalable dynamic benchmark. On competition-level mathematical benchmarks, RIDE reduces average performance across 26 models by 21.73%, substantially exposing their reasoning limitations and empirically validating the assessment’s sensitivity and effectiveness.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.
Problem

Research questions and friction points this paper is trying to address.

Evaluating true mathematical reasoning beyond data leakage
Generating well-posed adversarial questions with controlled difficulty
Measuring robustness degradation in LLMs using perturbed benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Item Response Theory to measure question difficulty
Simulates student responses to build difficulty ranker
Employs reinforcement learning for adversarial question rewriting
🔎 Similar Papers
No similar papers found.
X
Xinyuan Li
East China Normal University
M
Murong Xu
East China Normal University
W
Wenbiao Tao
East China Normal University
H
Hanlun Zhu
East China Normal University
Y
Yike Zhao
East China Normal University
Jipeng Zhang
Jipeng Zhang
Hong Kong University of Science and Technology
natural language processingquestion answering
Y
Yunshi Lan
East China Normal University