🤖 AI Summary
This work addresses the pervasive issue of length inflation in large language models trained with reinforcement learning, where reward-driven optimization often leads to excessively verbose outputs without compromising downstream performance. To tackle this challenge, the authors propose Group Relative Reward Rescaling (GR³), a novel framework that introduces the first general, continuous, and reward-dependent length gating mechanism. GR³ dynamically adapts to instance difficulty and preserves high-quality trajectory signals through multiplicative reward rescaling, group-relative regularization, and advantage-aware calibration. Seamlessly integrated into the GRPO training pipeline, the method significantly mitigates length inflation under both RLHF and RLVR paradigms, outperforming current state-of-the-art baselines while maintaining or even enhancing downstream task performance.
📝 Abstract
Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.