Tensor-Parallelism with Partially Synchronized Activations

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high communication bandwidth bottleneck caused by activation synchronization in tensor-parallel training and inference of large language models (LLMs), this paper proposes CAAT-Net, a communication-aware tensor-parallel architecture. Its core innovation is a partial activation synchronization mechanism: non-critical layers dynamically skip cross-device activation synchronization while preserving gradient consistency and training stability. This significantly reduces inter-device communication volume—cutting tensor-parallel communication overhead by 50% for both 1B and 7B models—without noticeable degradation in pretraining loss or downstream task accuracy. Consequently, training and inference throughput improve by approximately 1.8× and 1.6×, respectively. CAAT-Net establishes a lightweight, practical communication optimization paradigm for efficient distributed training of large-scale LLMs.

Technology Category

Application Category

📝 Abstract
Training and inference of Large Language Models (LLMs) with tensor-parallelism requires substantial communication to synchronize activations. Our findings suggest that with a few minor adjustments to current practices, LLMs can be trained without fully synchronizing activations, reducing bandwidth demands. We name this "Communication-Aware Architecture for Tensor-parallelism" (CAAT-Net). We train 1B and 7B parameter CAAT-Net models, with a 50% reduction in tensor-parallel communication and no significant drop in pretraining accuracy. Furthermore, we demonstrate how CAAT-Net accelerates both training and inference workloads.
Problem

Research questions and friction points this paper is trying to address.

Reduces communication in tensor-parallel LLM training
Minimizes activation synchronization without accuracy loss
Accelerates both training and inference workloads
Innovation

Methods, ideas, or system contributions that make the work stand out.

Partially synchronized activations reduce bandwidth
CAAT-Net cuts tensor-parallel communication by 50%
Maintains accuracy while accelerating training and inference
🔎 Similar Papers
No similar papers found.