MindSpeed RL: Distributed Dataflow for Scalable and Efficient RL Training on Ascend NPU Cluster

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

Reinforcement learning (RL) training systems suffer from poor cluster scalability and low memory utilization due to strong cross-node dependencies. To address this, we propose a distributed dataflow architecture tailored for large-scale RL training, featuring a novel distributed transfer hub and an allgather-swap mechanism that decouples sample streaming from resharding traffic, eliminating centralized scheduling bottlenecks and substantially reducing communication overhead and redundant memory consumption. Integrated with a dynamic controller, warehouse-style deployment, optimized resharding communication, and multi-dimensional parallelism acceleration, our design enables holistic system-level co-optimization. Evaluated on a 384-chip Ascend NPU supercomputing cluster, our system achieves 1.42–3.97× higher throughput than state-of-the-art baselines and efficiently supports alignment training of billion- to trillion-parameter models, including Qwen and DeepSeek.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) is a paradigm increasingly used to align large language models. Popular RL algorithms utilize multiple workers and can be modeled as a graph, where each node is the status of a worker and each edge represents dataflow between nodes. Owing to the heavy cross-node dependencies, the RL training system usually suffers from poor cluster scalability and low memory utilization. In this article, we introduce MindSpeed RL, an effective and efficient system for large-scale RL training. Unlike existing centralized methods, MindSpeed RL organizes the essential data dependencies in RL training, i.e., sample flow and resharding flow, from a distributed view. On the one hand, a distributed transfer dock strategy, which sets controllers and warehouses on the basis of the conventional replay buffer, is designed to release the dispatch overhead in the sample flow. A practical allgather--swap strategy is presented to eliminate redundant memory usage in resharding flow. In addition, MindSpeed RL further integrates numerous parallelization strategies and acceleration techniques for systematic optimization. Compared with existing state-of-the-art systems, comprehensive experiments on the RL training of popular Qwen2.5-Dense-7B/32B, Qwen3-MoE-30B, and DeepSeek-R1-MoE-671B show that MindSpeed RL increases the throughput by 1.42 ~ 3.97 times. Finally, we open--source MindSpeed RL and perform all the experiments on a super pod of Ascend with 384 neural processing units (NPUs) to demonstrate the powerful performance and reliability of Ascend.

Problem

Research questions and friction points this paper is trying to address.

Improves poor cluster scalability in RL training

Reduces low memory utilization in RL systems

Optimizes cross-node dependencies in distributed RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed transfer dock strategy for sample flow

Allgather-swap strategy for resharding flow

Integrated parallelization and acceleration techniques

🔎 Similar Papers

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost