CBA: Communication-Bound-Aware Cross-Domain Resource Assignment for Pipeline-Parallel Distributed LLM Training in Dynamic Multi-DC Optical Networks

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the cross-domain communication bottleneck hindering large language model (LLM) pipeline parallel training in multi-datacenter optical networks, this paper proposes the first communication-aware cross-domain resource allocation framework. The framework jointly models the strong coupling among dynamic optical network bandwidth, inter-datacenter transmission latency, and computational load, integrating integer linear programming optimization, communication–computation co-modeling, and real-time topology-aware scheduling. Experimental results demonstrate that, compared to baseline approaches, the framework reduces per-iteration training time by 31.25% and request blocking rate by 13.20%, while significantly improving training throughput and heterogeneous resource utilization. It establishes a scalable, optical-network–aware co-optimization paradigm for large-scale distributed LLM training.

Technology Category

Application Category

📝 Abstract
We propose a communication-bound-aware cross-domain resource assignment framework for pipeline-parallel distributed training over multi-datacenter optical networks, which lowers iteration time by 31.25% and reduces 13.20% blocking requests compared to baselines.
Problem

Research questions and friction points this paper is trying to address.

Optimizes cross-domain resource assignment for distributed LLM training
Reduces communication bottlenecks in multi-datacenter optical networks
Improves training efficiency by lowering iteration time and blocking requests
Innovation

Methods, ideas, or system contributions that make the work stand out.

Communication-bound-aware cross-domain resource assignment framework
Pipeline-parallel distributed LLM training optimization
Dynamic multi-datacenter optical network resource allocation
🔎 Similar Papers
No similar papers found.
D
Dianxuan Fu
State Key Laboratory of Photonics and Communications, School of Information Science and Electronic Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Xiaomin Liu
Xiaomin Liu
State Key Laboratory of Photonics and Communications, School of Information Science and Electronic Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Yihao Zhang
Yihao Zhang
Peking University
AI SafetyFormal MethodMechanistic Interpretability
S
Shikui Shen
China Unicom Research Institute, Beijing, 100048, China
W
Weisheng Hu
State Key Laboratory of Photonics and Communications, School of Information Science and Electronic Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Q
Qunbi Zhuge
State Key Laboratory of Photonics and Communications, School of Information Science and Electronic Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China