Soft Policy Optimization: Online Off-Policy RL for Sequence Models

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing RL-based LLM post-training methods (e.g., PPO) rely on online policy updates, limiting effective utilization of off-policy data such as human demonstrations and historical trajectories—resulting in low sample efficiency, constrained exploration, and insufficient policy diversity. Moreover, they require separate value networks, incurring high GPU memory overhead and communication bottlenecks. Method: We propose the first value-model-free framework unifying online and offline soft policy optimization for sequence models, grounded in maximum-entropy RL. Our approach enables end-to-end trajectory optimization without explicit value functions, leveraging importance-weighted trajectory replay and adaptive temperature control. Contribution/Results: The method achieves theoretical rigor and engineering scalability. On code competition benchmarks, it surpasses PPO in pass@10, accelerates training by 2.3×, reduces GPU memory usage by 68%, and significantly improves policy diversity and training stability.

Technology Category

Application Category

📝 Abstract

RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training, in earlier runs, by human experts or other policies, or by decoding and exploration methods. This results in severe sample inefficiency and exploration difficulties, as well as a potential loss of diversity in the policy responses. Moreover, asynchronous PPO implementations require frequent and costly model transfers, and typically use value models which require a large amount of memory. In this paper we introduce Soft Policy Optimization (SPO), a simple, scalable and principled Soft RL method for sequence model policies that can learn from arbitrary online and offline trajectories and does not require a separate value model. In experiments on code contests, we shows that SPO outperforms PPO on pass@10, is significantly faster and more memory efficient, is able to benefit from off-policy data, enjoys improved stability, and learns more diverse (i.e. soft) policies.

Problem

Research questions and friction points this paper is trying to address.

Improves sample efficiency in RL-based language model training.

Enables learning from arbitrary online and offline trajectories.

Reduces memory usage and enhances policy diversity.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Policy Optimization for sequence models

Learns from arbitrary online and offline trajectories

Eliminates need for separate value model

🔎 Similar Papers

Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion