Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the pervasive issue of length inflation in large language models trained with reinforcement learning, where reward-driven optimization often leads to excessively verbose outputs without compromising downstream performance. To tackle this challenge, the authors propose Group Relative Reward Rescaling (GR³), a novel framework that introduces the first general, continuous, and reward-dependent length gating mechanism. GR³ dynamically adapts to instance difficulty and preserves high-quality trajectory signals through multiplicative reward rescaling, group-relative regularization, and advantage-aware calibration. Seamlessly integrated into the GRPO training pipeline, the method significantly mitigates length inflation under both RLHF and RLVR paradigms, outperforming current state-of-the-art baselines while maintaining or even enhancing downstream task performance.

Technology Category

Application Category

📝 Abstract

Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.

Problem

Research questions and friction points this paper is trying to address.

length inflation

reinforcement learning

reward optimization

verbosity

reasoning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

length inflation

reward rescaling

reinforcement learning