AllReduce Scheduling with Hierarchical Deep Reinforcement Learning

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing AllReduce scheduling methods suffer from poor generalizability due to topology-specific designs or handcrafted heuristics. To address this, this paper proposes a hierarchical deep reinforcement learning (DRL) framework that enables end-to-end automatic scheduling across heterogeneous network topologies—including BCube, DCell, and Jellyfish—for the first time. The framework employs a dual-level collaborative Proximal Policy Optimization (PPO) architecture, jointly modeling traffic scheduling and abstracting topology structure to eliminate reliance on domain expertise and topology-specific features. It is accompanied by an open-source, high-fidelity AllReduce simulation environment. Experimental results demonstrate that our method achieves, on average, a 19.7% reduction in scheduling latency and a 22.3% increase in throughput across all three topologies, significantly outperforming conventional heuristic and static scheduling baselines.

Technology Category

Application Category

📝 Abstract

AllReduce is a technique in distributed computing which saw use in many critical applications of deep learning. Existing methods of AllReduce scheduling oftentimes lack flexibility due to being topology-specific or relying on extensive handcrafted designs that require domain-specific knowledge. In this work, we aim to alleviate this inflexibility by proposing a deep-reinforcement-learning (DRL)-based pipeline that can generate AllReduce scheduling for various network topologies without topology-specific design features. The flow scheduling module of this pipeline consists of two hierarchically-structured DRL policies that work cooperatively to find optimal scheduling. We showcase the performance of our method compared to the baseline methods on three topologies: BCube, DCell, and Jellyfish. Finally, we contributed a Python-based simulation environment simulating AllReduce scheduling on these network topologies.

Problem

Research questions and friction points this paper is trying to address.

AllReduce scheduling lacks flexibility in existing methods

Propose DRL-based pipeline for topology-agnostic scheduling

Evaluate performance on BCube, DCell, and Jellyfish topologies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical DRL for AllReduce scheduling

Topology-agnostic scheduling pipeline

Python-based simulation environment

🔎 Similar Papers

Deep Reinforcement Learning for Dynamic Order Picking in Warehouse Operations