Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses a critical limitation in existing reinforcement learning with verifiable rewards (RLVR) methods—such as GSPO—where length bias during training leads to uncontrolled or even collapsed response lengths, thereby hindering reasoning performance. The study is the first to uncover the underlying mechanism driving this phenomenon and proposes Length-Unbiased Sequential Policy Optimization (LUSPO), a novel approach that stabilizes generation length by reformulating the loss function to eliminate length bias. Experimental results demonstrate that LUSPO significantly outperforms prior methods, including GRPO and GSPO, across both mathematical and multimodal reasoning benchmarks, establishing a new state-of-the-art optimization strategy in RLVR.

Technology Category

Application Category

📝 Abstract

Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.

Problem

Research questions and friction points this paper is trying to address.

response length variation

length bias

reinforcement learning with verifiable rewards

reasoning capability

sequence policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Length-Unbiased

Sequence Policy Optimization

Response Length Bias