Soft Adaptive Policy Optimization

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

In MoE-based LLMs under reinforcement learning, high variance in token-level importance ratios and unstable policy updates hinder training. To address this, we propose Soft-TRPO—a smooth policy optimization method integrating soft adaptive gating, sequence-level consistency constraints, and temperature-scaled gradient scaling. Soft-TRPO replaces hard token pruning with differentiable soft gating, enabling selective attenuation of off-policy tokens within a continuous trust region, thereby balancing sequence-level stability and token-level adaptivity. Empirically, Soft-TRPO significantly improves training stability and Pass@1 accuracy on mathematical reasoning benchmarks. Moreover, it generalizes effectively to the Qwen3-VL family of multimodal large language models, demonstrating robust performance across diverse tasks and model scales. Our approach thus offers both theoretical grounding—via principled trust-region regularization—and practical scalability, establishing a new state-of-the-art for stable, efficient RL fine-tuning in MoE architectures.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.

Problem

Research questions and friction points this paper is trying to address.

High variance in token-level importance ratios causes unstable policy updates

Hard clipping methods struggle to balance stability and effective learning

Mixture-of-Experts models exacerbate variance issues in reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Adaptive Policy Optimization replaces hard clipping

Uses temperature-controlled gate for adaptive updates

Maintains sequence coherence and token-adaptive learning

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation