Toward Co-adapting Machine Learning Job Shape and Cluster Topology

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing schedulers for distributed machine learning jobs in multi-tenant torus-topology clusters struggle to simultaneously optimize communication efficiency—constrained by task shape—and cluster utilization, often facing an inherent trade-off between the two. This paper proposes the first joint dynamic adaptation framework that co-optimizes task shape and optical circuit-switched topology. By identifying isomorphic task shapes and enabling real-time topology reconfiguration, our approach jointly satisfies job placement and communication requirements while minimizing network contention and maximizing resource utilization. Evaluations on a 4,096-node torus cluster simulator demonstrate that our method improves absolute cluster utilization by 57% and reduces job completion time by up to 11×, significantly surpassing the performance limits of conventional schedulers.

Technology Category

Application Category

📝 Abstract
Allocating resources to distributed machine learning jobs in multi-tenant torus-topology clusters must meet each job's specific placement and communication requirements, which are typically described using shapes. There is an inherent tension between minimizing network contention and maximizing cluster utilization when placing various-shaped jobs. While existing schedulers typically optimize for one objective at the expense of the other, we demonstrate that both can be achieved simultaneously. Our proposed approach, RFold, adapts both job shapes and the underlying cluster topology at runtime. This is accomplished by combining two techniques: (1) identifying homomorphic job shapes that support the jobs communication needs, and (2) reconfiguring the optical circuit switch-enabled topology to support more diverse job shapes. Preliminary evaluation performed on a 4096-node torus cluster simulator indicates that RFold can improve absolute cluster utilization by 57% and reduce job completion time by up to 11x relative to existing methods
Problem

Research questions and friction points this paper is trying to address.

Optimizing resource allocation for distributed ML jobs in torus clusters
Balancing network contention reduction with cluster utilization maximization
Adapting job shapes and cluster topology simultaneously at runtime
Innovation

Methods, ideas, or system contributions that make the work stand out.

RFold adapts job shapes and cluster topology dynamically
Identifies homomorphic shapes for job communication needs
Reconfigures optical circuit switch-enabled topology for diversity
🔎 Similar Papers
No similar papers found.