DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitations of existing reinforcement learning approaches in long-chain reasoning tasks, where coarse-grained, sequence-level credit assignment hinders the identification of critical reasoning steps, and standard KL-divergence penalties often lead to gradient instability and overly conservative policies. To overcome these challenges, the paper proposes a novel critic-free reinforcement learning framework that reframes distributional deviation not as a rigid penalty but as a guiding signal, enabling fine-grained, step-level credit assignment. By eliminating conventional KL constraints, the method effectively mitigates gradient instability, promotes policy diversity, and substantially enhances both the identification of pivotal reasoning steps and overall performance on complex reasoning tasks.

📝 Abstract

Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty.

Problem

Research questions and friction points this paper is trying to address.

credit assignment

reinforcement learning

large language models

Chain of Thought

KL divergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution Guided Policy Optimization

fine-grained credit assignment

Chain of Thought