🤖 AI Summary
To address low resource utilization and poor scalability in GPU cluster scheduling for deep learning training, this paper proposes a novel job placement strategy based on graph matching. The method unifies multidimensional scheduling constraints—including topology awareness, GPU memory limits, migration overhead, and task packing density—into a weighted bipartite graph matching formulation. A lightweight distributed solver framework is designed to maintain high scheduling quality while significantly improving scalability. Experiments against state-of-the-art schedulers (e.g., Gandiva, Tiresias) show that our approach reduces average job completion time by 1.62× and total execution span by 1.15×; moreover, scheduling latency scales nearly linearly with cluster size. The key contribution lies in the first systematic modeling of fine-grained, co-dependent resource constraints as an efficiently solvable graph matching problem—achieving a balanced trade-off among performance, scalability, and practical deployability.
📝 Abstract
Training deep learning (DL) models has become a dominant workload in data-centers and improving resource utilization is a key goal of DL cluster schedulers. In order to do this, schedulers typically incorporate placement policies that govern where jobs are placed on the cluster. Existing placement policies are either designed as ad-hoc heuristics or incorporated as constraints within a complex optimization problem and thus either suffer from suboptimal performance or poor scalability. Our key insight is that many placement constraints can be formulated as graph matching problems and based on that we design novel placement policies for minimizing job migration overheads and job packing. We integrate these policies into Tesserae and describe how our design leads to a scalable and effective GPU cluster scheduler. Our experimental results show that Tesserae improves average JCT by up to 1.62x and the Makespan by up to 1.15x compared with the existing schedulers.