Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning for verifiable reasoning (RLVR) methods suffer from inefficient exploration during training due to a mismatch between problem difficulty and large language model (LLM) capability—problems that are too hard yield no feasible solutions, while those too easy provide insufficient learning signal. Method: We propose a capability-adaptive dynamic difficulty adjustment framework, the first to integrate Item Response Theory (IRT) into LLM-based reinforcement learning. It estimates and adjusts instance-level prompt length in real time via multi-round rollout sampling, maintaining models within the “zone of proximal development” for optimal learning. The method jointly leverages verifiable rewards, supervision-augmented prompt engineering, and IRT-based modeling for fine-grained, online training optimization. Results: On six mathematical reasoning benchmarks, our approach outperforms GRPO and supervised fine-tuning (SFT) by +11.8 and +10.5 points on average, respectively, and exceeds prior best prompt-augmented methods by 3.6 points.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs). However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability. LLMs fail to discover viable reasoning paths when problems are overly difficult, while learning little new capability when problems are too simple. In this work, we formalize the impact of problem difficulty by quantifying the relationship between loss descent speed and rollout accuracy. Building on this analysis, we propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region. SEELE augments each training sample by appending a hint (part of a full solution) after the original problem. Unlike previous hint-based approaches, SEELE deliberately and adaptively adjusts the hint length for each problem to achieve an optimal difficulty. To determine the optimal hint length, SEELE employs a multi-round rollout sampling strategy. In each round, it fits an item response theory model to the accuracy-hint pairs collected in preceding rounds to predict the required hint length for the next round. This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency. Experimental results show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +11.8 and +10.5 points, respectively, and surpasses the best previous supervision-aided approach by +3.6 points on average across six math reasoning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing problem difficulty for efficient reinforcement learning
Adapting hint length to match model capability dynamically
Improving exploration efficiency in reasoning tasks via scaffolding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive hint length adjustment per problem
Multi-round rollout sampling for difficulty optimization
Instance-level real-time alignment with model capability
🔎 Similar Papers