Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This paper addresses the performance collapse issue in large language model (LLM) reinforcement learning caused by training on stale rollout data. We propose an off-policy training paradigm centered on the M2PO algorithm, which extends importance sampling with second-moment-constrained importance weight regularization to dynamically suppress high-variance token-level updates—balancing training stability and information efficiency. Theoretically and empirically, we identify a “boom-before-bust” phenomenon, revealing a finite usability window for outdated rollouts. Evaluated across six LLMs and eight benchmarks, M2PO enables stable training with rollout delays of up to 256 steps, matching on-policy performance while significantly reducing gradient variance. Our approach overcomes a key practical bottleneck hindering off-policy RL adoption in LLM alignment.

Technology Category

Application Category

📝 Abstract

Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.

Problem

Research questions and friction points this paper is trying to address.

Addresses performance degradation in off-policy RL with stale data

Proposes method to stabilize training using constrained importance weights

Enables scalable asynchronous RL while matching on-policy performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

M2PO constrains second moment of importance weights

It suppresses extreme outliers while preserving informative updates

Enables stable off-policy training with stale data

🔎 Similar Papers

Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion