🤖 AI Summary
Reinforcement learning (RL) for large language model post-training faces scalability bottlenecks due to load imbalance in distributed training.
Method: This paper proposes a fully distributed, multi-controller architecture that eliminates centralized coordination, decoupling resource scheduling from execution logic. It supports heterogeneous task streams and dynamic execution control via decentralized task scheduling, fine-grained data parallelism, and adaptive flow management—enabling end-to-end distributed RL training.
Contribution/Results: Experiments demonstrate near-linear scalability up to 1,000 GPUs; end-to-end throughput improves by up to 7× over state-of-the-art frameworks. The architecture significantly enhances efficiency, flexibility, and scalability of large-scale RL training while maintaining robustness under dynamic workloads and hardware heterogeneity.
📝 Abstract
Reinforcement learning (RL) has become the pivotal post-training technique for large language model. Effectively scaling reinforcement learning is now the key to unlocking advanced reasoning capabilities and ensuring safe, goal-aligned behavior in the most powerful LLMs. Mainstream frameworks usually employ a hybrid-controller architecture where a single-controller dispatches the overall execution logic and manages overall data transfer and the multi-controller executes distributed computation. For large-scale reinforcement learning, minor load imbalances can introduce significant bottlenecks, ultimately constraining the scalability of the system. To address this limitation, we introduce DistFlow, a novel, fully distributed RL framework designed to break scaling barrier. We adopt a multi-controller paradigm that dispatches data transfer and execution tasks to all workers, which eliminates the centralized node. This allows each worker to operate independently, leading to near-linear scalability up to thousands of GPUs and dramatic efficiency gains. Furthermore, our architecture decouples resource configuration from execution logic, allowing each worker to have a unique execution flow, offering significant flexibility for rapid and cost-effective algorithmic experimentation. Extensive experiments show that DistFlow achieves excellent linear scalability and up to a 7x end-to-end throughput improvement over state-of-the-art (SOTA) frameworks.