🤖 AI Summary
This work addresses the limitations of existing reinforcement learning approaches in long-chain reasoning tasks, where coarse-grained, sequence-level credit assignment hinders the identification of critical reasoning steps, and standard KL-divergence penalties often lead to gradient instability and overly conservative policies. To overcome these challenges, the paper proposes a novel critic-free reinforcement learning framework that reframes distributional deviation not as a rigid penalty but as a guiding signal, enabling fine-grained, step-level credit assignment. By eliminating conventional KL constraints, the method effectively mitigates gradient instability, promotes policy diversity, and substantially enhances both the identification of pivotal reasoning steps and overall performance on complex reasoning tasks.
📝 Abstract
Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty.