VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Large language models in reinforcement learning are prone to distributional shift and training collapse due to policy staleness, asynchronous training, and train-inference mismatch. This work proposes the first soft policy optimization framework that integrates variational inference with sequence-level importance weighting. By introducing a variance reduction mechanism, the method derives a closed-form reweighting kernel that obviates the need for length normalization, enabling stable off-policy training. It overcomes the limitations of conventional token-level clipping or sequence normalization, supporting policy staleness ratios up to 64× and fully asynchronous execution while achieving consistent performance gains on mathematical reasoning benchmarks across both dense and mixture-of-experts architectures.

Technology Category

Application Category

📝 Abstract

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO

Problem

Research questions and friction points this paper is trying to address.

training stability

large language models

off-policy training

policy staleness

distribution shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Inference

Sequence-Level Importance Weighting

Off-Policy RL