StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This paper addresses the scalability and cost-efficiency bottlenecks in reinforcement learning (RL) post-training of large language models (LLMs), caused by resource coupling across RL stages. To this end, we propose a fully decoupled streaming RL architecture. Our method introduces: (1) a novel streaming sample generation mechanism that eliminates rigid synchronization boundaries between traditional RL phases; (2) an output-length sorting model to capture the long-tail distribution of sequence lengths, enabling skew-aware scheduling; and (3) explicit separation of generation and training stages, heterogeneous resource allocation, and cross-data-center deployment. Experiments demonstrate up to a 2.66× throughput improvement and a 1.33× cost-efficiency gain in heterogeneous, cross-data-center settings. The architecture significantly mitigates both pipeline bubbles—arising from stage-level synchronization—and skew bubbles—caused by imbalanced output-length distributions—thereby enhancing overall system utilization and training efficiency.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs). RL for LLMs involves two stages: generation and training. The LLM first generates samples online, which are then used to derive rewards for training. The conventional view holds that the colocated architecture, where the two stages share resources via temporal multiplexing, outperforms the disaggregated architecture, in which dedicated resources are assigned to each stage. However, in real-world deployments, we observe that the colocated architecture suffers from resource coupling, where the two stages are constrained to use the same resources. This coupling compromises the scalability and cost-efficiency of colocated RL in large-scale training. In contrast, the disaggregated architecture allows for flexible resource allocation, supports heterogeneous training setups, and facilitates cross-datacenter deployment. StreamRL is designed with disaggregation from first principles and fully unlocks its potential by addressing two types of performance bottlenecks in existing disaggregated RL frameworks: pipeline bubbles, caused by stage dependencies, and skewness bubbles, resulting from long-tail output length distributions. To address pipeline bubbles, StreamRL breaks the traditional stage boundary in synchronous RL algorithms through stream generation and achieves full overlapping in asynchronous RL. To address skewness bubbles, StreamRL employs an output-length ranker model to identify long-tail samples and reduces generation time via skewness-aware dispatching and scheduling. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems, and improves cost-effectiveness by up to 1.33x in a heterogeneous, cross-datacenter setting.

Problem

Research questions and friction points this paper is trying to address.

Addresses resource coupling in colocated RL for LLMs

Solves pipeline and skewness bubbles in disaggregated RL

Improves scalability and cost-efficiency in RL training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated architecture for flexible resource allocation

Stream generation to eliminate pipeline bubbles

Skewness-aware dispatching for efficient scheduling

🔎 Similar Papers

No similar papers found.