🤖 AI Summary
This work addresses a critical limitation in multi-reward reinforcement learning, where the existing method GRPO suffers from collapsed advantage estimates due to uniform normalization across combined reward components, thereby degrading training signal resolution and leading to convergence difficulties or premature failure. To overcome this, we propose Gradient Descent Policy Optimization with Decoupled Normalization (GDPO), which normalizes each reward component independently to preserve their relative differences, thus enhancing the accuracy and stability of advantage estimation. Theoretical analysis reveals, for the first time, the normalization flaw of GRPO in multi-reward settings. Extensive experiments demonstrate that GDPO consistently outperforms GRPO across diverse tasks—including tool use, mathematical reasoning, and code generation—achieving significant improvements in accuracy, error rate, and adherence to format constraints, while exhibiting superior performance and generalization capability.
📝 Abstract
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.