RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

In synchronous reinforcement learning (RL) post-training, heterogeneous rollout response lengths induce GPU computation bubbles—idle cycles during synchronization—severely limiting training efficiency. To address this, we propose Tail-Batching, a scheduling mechanism that dynamically clusters long-tail responses into a few training steps for concentrated execution, thereby preventing them from throttling global synchronization. Integrated with elastic parallel inference, streaming training, and dynamic resource allocation, Tail-Batching enables end-to-end system-level co-optimization. Crucially, it achieves substantial reduction in GPU idle time without compromising policy accuracy. Experiments on a 128-GPU H800 cluster using Qwen2.5-family models demonstrate that our approach accelerates end-to-end RLHF training by 2.03–2.56× over veRL and by up to 2.24× over RLHFuse. This work delivers a highly efficient and scalable systems solution for large-scale RLHF training.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) is a pivotal post-training technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, synchronous RL post-training often suffers from significant GPU underutilization, referred to as bubbles, caused by imbalanced response lengths within rollout steps. Many RL systems attempt to alleviate this problem by relaxing synchronization, but this can compromise training accuracy. In this paper, we introduce tail batching, a novel rollout scheduling strategy for synchronous RL that systematically consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds), while ensuring that the majority of steps (short rounds) involve only balanced, short rollouts. By excluding long responses from short rounds and rescheduling them into a few designated long rounds, tail batching effectively reduces GPU idle time during rollouts and significantly accelerates RL training without sacrificing accuracy. We present RollPacker, a system that fully harnesses the benefits of tail batching through holistic optimizations across all three RL stages: elastic parallelism adaptation for rollout, dynamic resource allocation and scheduling for reward, and stream-based training. Empirical results show that RollPacker achieves a 2.03x-2.56x end-to-end training time reduction compared to veRL and up to 2.24x speedup compared to RLHFuse for the Qwen2.5 family of LLMs on up to 128 H800 GPUs.

Problem

Research questions and friction points this paper is trying to address.

GPU underutilization due to imbalanced response lengths in synchronous RL

Training accuracy compromise when relaxing synchronization to reduce bubbles

Long-tail rollouts causing significant idle time during RL post-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tail batching consolidates long responses into designated rounds

Elastic parallelism adaptation optimizes the rollout stage

Dynamic resource allocation and scheduling improves reward stage

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL