Efficient Direct-Connect Topologies for Collective Communications

📅 2022-02-07
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Existing direct-connect network topologies exhibit inadequate adaptability to diverse scales, node degrees, and latency–bandwidth trade-offs in high-performance computing (HPC) collective communication. Method: This paper proposes an iterative expansion framework grounded in small-scale optimal base topologies. It introduces, for the first time, a graph-synthesis-driven automatic topology generation mechanism and designs the first polynomial-time collective communication scheduling algorithm for canonical large-scale topologies—including Dragonfly and Fat-Tree. Contribution/Results: The work unifies topology synthesis and scheduling optimization within a single modeling framework, enabling cross-platform deployment and large-scale simulation validation. Experimental evaluation demonstrates that the proposed approach reduces average communication latency by 23.6% and improves bandwidth utilization by 31.4% compared to conventional topologies, while significantly enhancing scalability and practical applicability in real-world HPC systems.
📝 Abstract
We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and schedules for a given cluster size and degree and then identifies the appropriate topology and schedule for a given workload. Our algorithms start from small, optimal base topologies and associated communication schedules and use techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient collective schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.
Problem

Research questions and friction points this paper is trying to address.

Network Structure
Optimization
Information Flow
Innovation

Methods, ideas, or system contributions that make the work stand out.

Smart Algorithm
Optimal Network Structure
Communication Efficiency Enhancement
🔎 Similar Papers
No similar papers found.