Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

247K/year
🤖 AI Summary
This work addresses the challenge of efficiently overlapping communication and computation in distributed large language model training, particularly when computation becomes the bottleneck. To tackle this, the authors propose a communication-computation co-optimization mechanism that integrates a unified cost model with a priority-based search algorithm, reducing the optimization complexity from exponential to linear. The approach is generalizable across diverse model architectures and parallelization strategies. Experimental results demonstrate consistent training speedups of 1.03× to 1.33× over state-of-the-art baselines such as NCCL and AutoCCL, evaluated on both high- and low-bandwidth GPU clusters.

Technology Category

Application Category

📝 Abstract
Overlapping communication with computation is crucial for distributed large-model training, yet optimizing it - especially when computation becomes the bottleneck-remains challenging. We present Lagom, a system that co-tunes communication parameters to balance resource usage between computation and communication. By introducing a unified cost model and a priority-based search algorithm, Lagom reduces optimization complexity from exponential to linear. Evaluations on high- and low-bandwidth GPU clusters show that Lagom achieves 1.07-1.33x and 1.03-1.27x speedup over NCCL and AutoCCL across diverse models and parallelizations.
Problem

Research questions and friction points this paper is trying to address.

communication-computation overlap
distributed LLM training
computation bottleneck
resource balancing
training efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

communication-computation overlap
distributed LLM training
cost model
priority-based search
system optimization