Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Large language models have reached performance saturation on standard reasoning benchmarks, leading to vanishing advantage signals and policy collapse in reinforcement learning. To address this, this work proposes the Mixed-CUTS training framework, which introduces a parameter-free Constrained Uniform Top-K Sampling (CUTS) strategy. CUTS performs structure-preserving uniform sampling within high-confidence candidate sets and blends it with conventional sampling to generate reasoning trajectories that balance exploration and exploitation. Coupled with a group-relative advantage estimation, the method effectively preserves solution diversity within the semantic manifold and substantially mitigates policy degradation. Experiments on the Qwen3 model demonstrate that Mixed-CUTS improves Pass@1 accuracy by 15.1% over standard GRPO on the AIME25 out-of-domain generalization benchmark.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving exploration. Unlike standard sampling that follows model biases, CUTS flattens the local optimization landscape by sampling uniformly from constrained high-confidence candidates. We integrate this into Mixed-CUTS, a training framework synergizing exploitative and exploratory rollouts to amplify intra-group advantage variance. Experiments on Qwen3 models demonstrate that our approach prevents policy degeneration and significantly boosts out-of-domain generalization. Notably, Mixed-CUTS improves Pass@1 accuracy on the challenging AIME25 benchmark by up to 15.1% over standard GRPO, validating that maintaining diversity within the semantic manifold is critical for rigorous reasoning.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Mode Collapse

Reasoning Saturation

Advantage Signal

Policy Degeneration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Reasoning Diversity

Constrained Uniform Sampling