🤖 AI Summary
To address three key bottlenecks in distributed large language model (LLM) training—high communication overhead during gradient synchronization, excessive memory footprint of optimizer states, and poor adaptability to heterogeneous hardware—this paper proposes ACCO, a scalable distributed optimization algorithm. ACCO introduces three core innovations: (1) a novel computation-communication overlap mechanism that eliminates asynchronous latency; (2) fine-grained sharding of optimizer states, reducing GPU memory consumption while preserving convergence consistency; and (3) an integrated design combining asynchronous gradient accumulation, pipelined communication scheduling, and delay compensation—enabling stable training without warmup and full compatibility with standard training dynamics. Experiments demonstrate that ACCO significantly reduces wall-clock time, improves training throughput across diverse LLM training and fine-tuning tasks, and effectively supports collaborative training on heterogeneous GPU clusters.
📝 Abstract
Training Large Language Models (LLMs) relies heavily on distributed implementations, employing multiple GPUs to compute stochastic gradients on model replicas in parallel. However, synchronizing gradients in data parallel settings induces a communication overhead increasing with the number of distributed workers, which can impede the efficiency gains of parallelization. To address this challenge, optimization algorithms reducing inter-worker communication have emerged, such as local optimization methods used in Federated Learning. While effective in minimizing communication overhead, these methods incur significant memory costs, hindering scalability: in addition to extra momentum variables, if communications are only allowed between multiple local optimization steps, then the optimizer's states cannot be sharded among workers. In response, we propose $ extbf{AC}$cumulate while $ extbf{CO}$mmunicate ($ exttt{ACCO}$), a memory-efficient optimization algorithm tailored for distributed training of LLMs. $ exttt{ACCO}$ allows to shard optimizer states across workers, overlaps gradient computations and communications to conceal communication costs, and accommodates heterogeneous hardware. Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications, eliminating the need for warmup steps and aligning with the training dynamics of standard distributed optimization while converging faster in terms of wall-clock time. We demonstrate the effectiveness of $ exttt{ACCO}$ on several LLMs training and fine-tuning tasks.