Routing for Large ML Models

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Large language model (LLM) training generates frequent, persistent, and highly regular cross-datacenter communication traffic, which is poorly served by conventional general-purpose traffic scheduling paradigms. Method: This paper proposes a network-wide efficiency quantification and dynamic routing optimization framework tailored to ML training characteristics. It introduces the first network-level training traffic efficiency metric, integrates traffic pattern modeling, integer linear programming (ILP)-based global optimization, and a lightweight online update mechanism to enable periodic, globally aware routing adjustments. Contribution/Results: Evaluated on real LLM training workloads, the framework reduces average communication latency by 18–32% and improves GPU effective utilization by up to 24%, significantly accelerating training convergence. Its core contribution lies in deeply embedding ML training traffic semantics into network-layer design—enabling communication efficiency to be quantifiable, optimizable, and practically deployable.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs), and other large machine learning models, involves repeated communication of large volumes of data across a data center network. The communication patterns induced by these training process exhibit high regularity and persistence, giving rise to significant opportunities for optimizing the manner in which flows are routed across the network. We present an algorithmic framework for extit{quantifying} network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically extit{optimizing} routing with respect to this global metric.

Problem

Research questions and friction points this paper is trying to address.

Optimizing data routing for large ML models

Quantifying network efficiency in LLM training

Improving communication patterns in data centers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithmic framework for network efficiency quantification

Periodic optimization of routing for large ML models

Leverages regularity in communication patterns for optimization

🔎 Similar Papers

No similar papers found.