PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generatio

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address three key challenges in reinforcement learning (RL)-enhanced reasoning for large language models (LLMs)—outdated policy data, RL algorithm performance degradation, and low AI accelerator utilization—this paper proposes an efficient online RL training framework tailored for long-sequence generation. Our method introduces: (1) an “in-flight weight update” mechanism that dynamically loads the latest policy weights during sequence generation, substantially improving data freshness; and (2) an asynchronous pipeline-parallel architecture that decouples data generation from model training, integrating real-time weight synchronization and fine-grained GPU scheduling. Experiments on a 128-H100 cluster demonstrate that our approach achieves approximately 2× speedup over conventional RL training while strictly preserving high online-ness and sustaining >92% GPU utilization—effectively alleviating throughput bottlenecks inherent in RLHF-style training.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effectively scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately $sim 2x$ faster learning compared to conventional RL baselines while maintaining highly on-policy training data. A scalable and modular open-source implementation of PipelineRL is also released as a key contribution.

Problem

Research questions and friction points this paper is trying to address.

Improving hardware efficiency while maintaining data freshness for RL in LLMs

Addressing the trade-off between accelerator utilization and on-policy training data

Overcoming stale data generation that harms reinforcement learning algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concurrent asynchronous data generation and model training

In-flight weight updates for minimal generation interruption

Maximizes accelerator utilization and training data freshness

🔎 Similar Papers

Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion