๐ค AI Summary
This work addresses the significant performance degradation caused by network congestion in multi-tenant GPU clusters, where over one-third of training jobs suffer reduced throughput. To tackle this, the authors propose a defragmentation scheduling approach based on job migrationโthe first to leverage job migration for congestion control in multi-tenant training environments, replacing conventional network-layer solutions. By formulating the problem via integer linear programming, integrating RDMA-based memory checkpointing for fast recovery, and optimizing placement with awareness of ring-based collective communication patterns, the method achieves a theoretically provable near-optimal bound on fragmentation and supports hybrid parallelism. Experiments demonstrate a 14% reduction in average job completion time on a 1024-GPU cluster, and under a 2048-GPU configuration with a 16:1 overload ratio, the p99 completion time remains within 5% of the ideal baseline.
๐ Abstract
We present MonkeyTree, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques. As cloud operators co-locate ML training jobs on shared, oversubscribed networks, congestion degrades training throughput for over a third of jobs. Prior approaches either rely on routing and flow scheduling--which we show have fundamental limits when traffic exceeds capacity--or require costly full-bisection bandwidth topologies with packet spraying. MonkeyTree exploits characteristics of ML training traffic: ring-based collectives generate exactly one cross-rack flow per rack a job spans, making congestion-free placements achievable. The sparse constraint structure admits abundant valid configurations, making them easy to reach with few migrations. Once reached, low fragmentation is self-reinforcing, as new arrivals disturb only a few racks. MonkeyTree formulates defragmentation as an integer linear program that minimizes worker movements, subject to per-rack fragmentation bounds. We prove a tight bound showing any placement can be defragmented to at most two cross-rack fragments per ToR, and extend the formulation to hybrid parallelism with multiple rings per server. Migration is implemented via in-memory checkpoint-and-restore over RDMA, incurring only 9.02 seconds of system overhead end-to-end per worker. We evaluate MonkeyTree using a custom simulator modeling clusters of up to 2,048 H200 GPUs and prototype on a five-node A100 testbed. MonkeyTree improves average job completion time by 14 percent over the next best baseline on a cluster of 1,024 GPUs with a 4:1 oversubscription. With a high 16:1 oversubscription ratio and 2,048 GPUs, MonkeyTree keeps p99 job completion time within 5 percent of ideal.