Routing for Large ML Models

๐Ÿ“… 2025-03-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language model (LLM) training generates frequent, persistent, and highly regular cross-datacenter communication traffic, which is poorly served by conventional general-purpose traffic scheduling paradigms. Method: This paper proposes a network-wide efficiency quantification and dynamic routing optimization framework tailored to ML training characteristics. It introduces the first network-level training traffic efficiency metric, integrates traffic pattern modeling, integer linear programming (ILP)-based global optimization, and a lightweight online update mechanism to enable periodic, globally aware routing adjustments. Contribution/Results: Evaluated on real LLM training workloads, the framework reduces average communication latency by 18โ€“32% and improves GPU effective utilization by up to 24%, significantly accelerating training convergence. Its core contribution lies in deeply embedding ML training traffic semantics into network-layer designโ€”enabling communication efficiency to be quantifiable, optimizable, and practically deployable.

Technology Category

Application Category

๐Ÿ“ Abstract
Training large language models (LLMs), and other large machine learning models, involves repeated communication of large volumes of data across a data center network. The communication patterns induced by these training process exhibit high regularity and persistence, giving rise to significant opportunities for optimizing the manner in which flows are routed across the network. We present an algorithmic framework for extit{quantifying} network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically extit{optimizing} routing with respect to this global metric.
Problem

Research questions and friction points this paper is trying to address.

Optimizing data routing for large ML models
Quantifying network efficiency in LLM training
Improving communication patterns in data centers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithmic framework for network efficiency quantification
Periodic optimization of routing for large ML models
Leverages regularity in communication patterns for optimization
๐Ÿ”Ž Similar Papers
No similar papers found.