RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In synchronous reinforcement learning (RL) post-training, heterogeneous rollout response lengths induce GPU computation bubbles—idle cycles during synchronization—severely limiting training efficiency. To address this, we propose Tail-Batching, a scheduling mechanism that dynamically clusters long-tail responses into a few training steps for concentrated execution, thereby preventing them from throttling global synchronization. Integrated with elastic parallel inference, streaming training, and dynamic resource allocation, Tail-Batching enables end-to-end system-level co-optimization. Crucially, it achieves substantial reduction in GPU idle time without compromising policy accuracy. Experiments on a 128-GPU H800 cluster using Qwen2.5-family models demonstrate that our approach accelerates end-to-end RLHF training by 2.03–2.56× over veRL and by up to 2.24× over RLHFuse. This work delivers a highly efficient and scalable systems solution for large-scale RLHF training.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) is a pivotal post-training technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, synchronous RL post-training often suffers from significant GPU underutilization, referred to as bubbles, caused by imbalanced response lengths within rollout steps. Many RL systems attempt to alleviate this problem by relaxing synchronization, but this can compromise training accuracy. In this paper, we introduce tail batching, a novel rollout scheduling strategy for synchronous RL that systematically consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds), while ensuring that the majority of steps (short rounds) involve only balanced, short rollouts. By excluding long responses from short rounds and rescheduling them into a few designated long rounds, tail batching effectively reduces GPU idle time during rollouts and significantly accelerates RL training without sacrificing accuracy. We present RollPacker, a system that fully harnesses the benefits of tail batching through holistic optimizations across all three RL stages: elastic parallelism adaptation for rollout, dynamic resource allocation and scheduling for reward, and stream-based training. Empirical results show that RollPacker achieves a 2.03x-2.56x end-to-end training time reduction compared to veRL and up to 2.24x speedup compared to RLHFuse for the Qwen2.5 family of LLMs on up to 128 H800 GPUs.
Problem

Research questions and friction points this paper is trying to address.

GPU underutilization due to imbalanced response lengths in synchronous RL
Training accuracy compromise when relaxing synchronization to reduce bubbles
Long-tail rollouts causing significant idle time during RL post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tail batching consolidates long responses into designated rounds
Elastic parallelism adaptation optimizes the rollout stage
Dynamic resource allocation and scheduling improves reward stage
🔎 Similar Papers
No similar papers found.
W
Wei Gao
Hong Kong University of Science and Technology
Yuheng Zhao
Yuheng Zhao
Fudan University
Data VisualizationVisual AnalyticsHuman-AI Collaboration
D
Dakai An
Hong Kong University of Science and Technology
Tianyuan Wu
Tianyuan Wu
CSE Department, HKUST
ML SystemsReinforcement Learning
L
Lunxi Cao
Hong Kong University of Science and Technology
S
Shaopan Xiong
Alibaba Group
J
Ju Huang
Alibaba Group
W
Weixun Wang
Taobao & Tmall Group of Alibaba
S
Siran Yang
Alibaba Group
W
Wenbo Su
Taobao & Tmall Group of Alibaba
J
Jiamang Wang
Alibaba Group
L
Lin Qu
Alibaba Group
B
Bo Zheng
Alibaba Group
W
Wei Wang
Hong Kong University of Science and Technology