🤖 AI Summary
Existing RLHF algorithms suffer from training instability, low efficiency, and poor compatibility with Mixture-of-Experts (MoE) architectures when employing token-level importance ratios. To address these issues, this paper proposes Sequence-level Importance Sampling (SIS): it defines importance weights via the full-sequence likelihood ratio, integrates sequence-level reward modeling, and employs an adaptive clipping strategy to enable end-to-end sequence-level policy optimization. By operating at the sequence level, SIS avoids cumulative token-level bias, significantly enhancing training stability—particularly for MoE-based large language models—while simplifying RL system design and reducing engineering complexity. Experiments on the Qwen3 series demonstrate that SIS improves training efficiency by 23% over GRPO, increases average win rate by 4.8%, and exhibits superior generalization across multi-task alignment and long-text generation benchmarks.
📝 Abstract
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.