RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion

📅 2024-09-20

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing RLHF training suffers from low GPU utilization due to data skew during the generation phase and pipeline bubbles during the training phase. This paper proposes a dual-path fused scheduling framework: (1) an inter-stage fusion mechanism that mitigates long-tail latency on the generation side by tightly coupling generation and training stages; and (2) an intra-stage fine-grained, micro-batch-level task decomposition and dynamic scheduling strategy that eliminates pipeline bubbles within the training stage. We introduce the first execution-unit abstraction for RLHF based on both individual samples and micro-batches, integrating stage-specific system optimizations and a parallel execution engine. Evaluated on multiple mainstream large language models, our approach achieves up to 3.7× higher training throughput and significantly improves real-world GPU utilization, establishing a new paradigm for efficient RLHF training.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) enhances the alignment between LLMs and human preference. The workflow of RLHF typically involves several models and tasks in a series of distinct stages. Existing RLHF training systems view each task as the smallest execution unit thus overlooking the opportunities for subtask-level optimizations. Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage, and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization in production deployments. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to mitigate the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches. By leveraging the intuition that pipeline execution can be essentially complemented by another pipeline, RLHFuse performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, resulting in fewer pipeline bubbles. In addition, RLHFuse incorporates a series of system optimizations tailored for each stage of RLHF, making it efficient and scalable for our internal product usage. We evaluate RLHFuse on various popular LLMs and the results show that RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems.

Problem

Research questions and friction points this paper is trying to address.

Improves GPU utilization in RLHF training

Mitigates data skewness in generation stage

Reduces pipeline bubbles in training stage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Splits tasks into sample-level subtasks

Performs inter-stage fusion for generation

Uses intra-stage fusion for training

🔎 Similar Papers

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework