Group Sequence Policy Optimization

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing RLHF algorithms suffer from training instability, low efficiency, and poor compatibility with Mixture-of-Experts (MoE) architectures when employing token-level importance ratios. To address these issues, this paper proposes Sequence-level Importance Sampling (SIS): it defines importance weights via the full-sequence likelihood ratio, integrates sequence-level reward modeling, and employs an adaptive clipping strategy to enable end-to-end sequence-level policy optimization. By operating at the sequence level, SIS avoids cumulative token-level bias, significantly enhancing training stability—particularly for MoE-based large language models—while simplifying RL system design and reducing engineering complexity. Experiments on the Qwen3 series demonstrate that SIS improves training efficiency by 23% over GRPO, increases average win rate by 4.8%, and exhibits superior generalization across multi-task alignment and long-text generation benchmarks.

Technology Category

Application Category

📝 Abstract

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

Problem

Research questions and friction points this paper is trying to address.

Develops stable RL algorithm for large language models

Improves training efficiency with sequence-level optimization

Stabilizes Mixture-of-Experts reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequence-level importance ratio definition

Sequence-level clipping and optimization

Stabilizes Mixture-of-Experts RL training

🔎 Similar Papers

No similar papers found.