OPPO: Accelerating PPO-based RLHF via Pipeline Overlap

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address strong sequential dependencies among models and throughput degradation caused by long-generation responses in PPO-based RLHF training, this paper proposes a pipelined overlap framework. The framework enables fine-grained task scheduling to achieve both intra-step and inter-step computation overlap, supports streaming delivery of actor outputs to the reward model, and dynamically defers processing of tail-end long responses. Integrated with streaming chunked transmission, adaptive oversubscription, and prefill overlap, it operates as a plug-and-play enhancement without modifying the core PPO logic. Experiments demonstrate that the method maintains identical convergence behavior and alignment performance while accelerating training by 1.8–2.8× and improving GPU utilization by 1.4–2.1×, significantly enhancing RLHF training efficiency.

Technology Category

Application Category

📝 Abstract

Proximal Policy Optimization (PPO)-based reinforcement learning from human feedback (RLHF) is a widely adopted paradigm for aligning large language models (LLMs) with human preferences. However, its training pipeline suffers from substantial inefficiencies due to sequential multi-model dependencies (e.g., reward model depends on actor outputs) and long-tail response lengths, where a few long responses straggle the stage completion. We present OPPO, a novel, lightweight, and model-agnostic PPO-based RLHF framework that improves training efficiency by overlapping pipeline execution. OPPO introduces two novel techniques: (1) Intra-step overlap, which streams upstream model outputs (e.g., actor model) in right-sized chunks, enabling the downstream model (e.g., reward) to begin prefill while the upstream continues decoding; and (2) Inter-step overlap, which adaptively overcommits a few prompts and defers long generations to future steps, mitigating tail latency without discarding partial work. OPPO integrates easily with existing PPO implementations with a few lines of code change. Extensive evaluations show that OPPO accelerates PPO-based RLHF training by $1.8 imes-2.8 imes$ and improves GPU utilization by $1.4 imes-2.1 imes$ without compromising training convergence.

Problem

Research questions and friction points this paper is trying to address.

Accelerates PPO-based RLHF training via pipeline overlapping

Reduces inefficiencies from sequential model dependencies

Mitigates long-tail latency in reinforcement learning pipeline

Innovation

Methods, ideas, or system contributions that make the work stand out.

Overlaps pipeline execution to accelerate training

Streams model outputs in chunks for early prefill

Adaptively defers long generations to mitigate latency

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL