π€ AI Summary
This work addresses the challenge in reinforcement learning where sparse binary rewards and weak credit assignment obscure optimization signals, making it difficult to leverage informative content from failed trajectories. To overcome this, the authors propose CIPO, a method operating within the Reinforcement Learning with Verifiable Rewards (RLVR) framework that, without external supervision, extracts corrective examples from the modelβs own failed trajectories to construct correction-oriented supervision signals. These signals are jointly optimized alongside the original policy objective, explicitly enhancing the modelβs self-correction and internal reasoning capabilities. Evaluated across 11 benchmarks in mathematical reasoning and code generation, CIPO substantially outperforms strong baselines, consistently improving reasoning accuracy, error correction performance, and pass@K metrics.
π Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.