🤖 AI Summary
Existing AllReduce scheduling methods suffer from poor generalizability due to topology-specific designs or handcrafted heuristics. To address this, this paper proposes a hierarchical deep reinforcement learning (DRL) framework that enables end-to-end automatic scheduling across heterogeneous network topologies—including BCube, DCell, and Jellyfish—for the first time. The framework employs a dual-level collaborative Proximal Policy Optimization (PPO) architecture, jointly modeling traffic scheduling and abstracting topology structure to eliminate reliance on domain expertise and topology-specific features. It is accompanied by an open-source, high-fidelity AllReduce simulation environment. Experimental results demonstrate that our method achieves, on average, a 19.7% reduction in scheduling latency and a 22.3% increase in throughput across all three topologies, significantly outperforming conventional heuristic and static scheduling baselines.
📝 Abstract
AllReduce is a technique in distributed computing which saw use in many critical applications of deep learning. Existing methods of AllReduce scheduling oftentimes lack flexibility due to being topology-specific or relying on extensive handcrafted designs that require domain-specific knowledge. In this work, we aim to alleviate this inflexibility by proposing a deep-reinforcement-learning (DRL)-based pipeline that can generate AllReduce scheduling for various network topologies without topology-specific design features. The flow scheduling module of this pipeline consists of two hierarchically-structured DRL policies that work cooperatively to find optimal scheduling. We showcase the performance of our method compared to the baseline methods on three topologies: BCube, DCell, and Jellyfish. Finally, we contributed a Python-based simulation environment simulating AllReduce scheduling on these network topologies.