🤖 AI Summary
To address the high communication bandwidth bottleneck caused by activation synchronization in tensor-parallel training and inference of large language models (LLMs), this paper proposes CAAT-Net, a communication-aware tensor-parallel architecture. Its core innovation is a partial activation synchronization mechanism: non-critical layers dynamically skip cross-device activation synchronization while preserving gradient consistency and training stability. This significantly reduces inter-device communication volume—cutting tensor-parallel communication overhead by 50% for both 1B and 7B models—without noticeable degradation in pretraining loss or downstream task accuracy. Consequently, training and inference throughput improve by approximately 1.8× and 1.6×, respectively. CAAT-Net establishes a lightweight, practical communication optimization paradigm for efficient distributed training of large-scale LLMs.
📝 Abstract
Training and inference of Large Language Models (LLMs) with tensor-parallelism requires substantial communication to synchronize activations. Our findings suggest that with a few minor adjustments to current practices, LLMs can be trained without fully synchronizing activations, reducing bandwidth demands. We name this "Communication-Aware Architecture for Tensor-parallelism" (CAAT-Net). We train 1B and 7B parameter CAAT-Net models, with a 50% reduction in tensor-parallel communication and no significant drop in pretraining accuracy. Furthermore, we demonstrate how CAAT-Net accelerates both training and inference workloads.