Dissecting Long Reasoning Models: An Empirical Study

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This paper systematically addresses three critical challenges in RL training for long-context reasoning models: (1) unclear mechanisms underlying positive and negative sample contributions; (2) low data efficiency in Group Relative Policy Optimization (GRPO); and (3) unstable performance evaluation across models and benchmarks. To tackle these, we propose three innovations: (1) demonstrating that negative samples alone suffice to achieve performance on par with full-sample RL; (2) designing a relative-length reward and offline negative-sample injection strategy to significantly improve data efficiency; and (3) identifying the root causes of evaluation ambiguity and validating that multi-round adaptive evaluation enhances result stability. Experiments across multiple benchmarks show that our methods improve generalization and robustness: zero-advantage sample ratio decreases by 42%, inference efficiency increases by 27%, and multi-round evaluation reduces metric variance by over 60%.

Technology Category

Application Category

📝 Abstract

Despite recent progress in training long-context reasoning models via reinforcement learning (RL), several open questions and counterintuitive behaviors remain. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in RL, revealing that positive samples mainly facilitate data fitting, whereas negative samples significantly enhance generalization and robustness. Interestingly, training solely on negative samples can rival standard RL training performance. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address this, we explore two straightforward strategies, including relative length rewards and offline sample injection, to better leverage these data and enhance reasoning efficiency and capability. (3) We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes, and demonstrate that multiple evaluation runs mitigate this issue.

Problem

Research questions and friction points this paper is trying to address.

Analyzing roles of positive and negative samples in RL training

Addressing data inefficiency in group relative policy optimization

Investigating unstable performance across reasoning models and benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes positive and negative samples in RL

Improves data efficiency with new strategies

Addresses instability via multiple evaluation runs

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting