Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently overlapping communication and computation in distributed large language model training, particularly when computation becomes the bottleneck. To tackle this, the authors propose a communication-computation co-optimization mechanism that integrates a unified cost model with a priority-based search algorithm, reducing the optimization complexity from exponential to linear. The approach is generalizable across diverse model architectures and parallelization strategies. Experimental results demonstrate consistent training speedups of 1.03× to 1.33× over state-of-the-art baselines such as NCCL and AutoCCL, evaluated on both high- and low-bandwidth GPU clusters.

Technology Category

Application Category

📝 Abstract
Overlapping communication with computation is crucial for distributed large-model training, yet optimizing it - especially when computation becomes the bottleneck-remains challenging. We present Lagom, a system that co-tunes communication parameters to balance resource usage between computation and communication. By introducing a unified cost model and a priority-based search algorithm, Lagom reduces optimization complexity from exponential to linear. Evaluations on high- and low-bandwidth GPU clusters show that Lagom achieves 1.07-1.33x and 1.03-1.27x speedup over NCCL and AutoCCL across diverse models and parallelizations.
Problem

Research questions and friction points this paper is trying to address.

communication-computation overlap
distributed LLM training
computation bottleneck
resource balancing
training efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

communication-computation overlap
distributed LLM training
cost model
priority-based search
system optimization
G
Guanbin Xu
University of Science and Technology of China
Z
ZhenGuo Xu
University of Science and Technology of China
Y
Yuzhe Li
University of Science and Technology of China
Y
Youhui Bai
University of Science and Technology of China
Ping Gong
Ping Gong
USTC
AI System
Chaoyi Ruan
Chaoyi Ruan
National University of Singapore
Cheng Li
Cheng Li
University of Science and Technology of China
AI/Storage SystemsOperating SystemsDistributed Systems