🤖 AI Summary
To address the cross-domain communication bottleneck hindering large language model (LLM) pipeline parallel training in multi-datacenter optical networks, this paper proposes the first communication-aware cross-domain resource allocation framework. The framework jointly models the strong coupling among dynamic optical network bandwidth, inter-datacenter transmission latency, and computational load, integrating integer linear programming optimization, communication–computation co-modeling, and real-time topology-aware scheduling. Experimental results demonstrate that, compared to baseline approaches, the framework reduces per-iteration training time by 31.25% and request blocking rate by 13.20%, while significantly improving training throughput and heterogeneous resource utilization. It establishes a scalable, optical-network–aware co-optimization paradigm for large-scale distributed LLM training.
📝 Abstract
We propose a communication-bound-aware cross-domain resource assignment framework for pipeline-parallel distributed training over multi-datacenter optical networks, which lowers iteration time by 31.25% and reduces 13.20% blocking requests compared to baselines.