TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work addresses the high communication overhead, significant compression errors, and heavy computational burden associated with intermediate tensors in large-scale tensor-parallel training of large language models. To tackle these challenges, the authors propose TACO, a novel framework that integrates data-driven tensor reordering with an adaptive scaled Hadamard transform to enable efficient, high-fidelity FP8 quantization. TACO employs a dual-scale mechanism to ensure numerical stability during training and introduces a highly fused compression operator to reduce memory traffic and kernel launch overhead. Seamlessly integrated into 3D parallel training pipelines, TACO achieves up to 1.87× end-to-end throughput improvement on GPT and Qwen models while maintaining near-lossless model accuracy.

Technology Category

Application Category

📝 Abstract
Handling communication overhead in large-scale tensor-parallel training remains a critical challenge due to the dense, near-zero distributions of intermediate tensors, which exacerbate errors under frequent communication and introduce significant computational overhead during compression. To this end, we propose TACO (Tensor-parallel Adaptive COmmunication compression), a robust FP8-based framework for compressing TP intermediate tensors. First, we employ a data-driven reshaping strategy combined with an Adaptive Scale-Hadamard Transform to enable high-fidelity FP8 quantization, while its Dual-Scale Quantization mechanism ensures numerical stability throughout training. Second, we design a highly fused compression operator to reduce memory traffic and kernel launch overhead, allowing efficient overlap with communication. Finally, we integrate TACO with existing state-of-the-art methods for Data and Pipeline Parallelism to develop a compression-enabled 3D-parallel training framework. Detailed experiments on GPT models and Qwen model demonstrate up to 1.87X end-to-end throughput improvement while maintaining near-lossless accuracy, validating the effectiveness and efficiency of TACO in large-scale training.
Problem

Research questions and friction points this paper is trying to address.

communication compression
tensor-parallel training
intermediate tensors
large language models
communication overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

communication compression
FP8 quantization
tensor-parallel training
adaptive scale-hadamard transform
3D-parallelism
M
Man Liu
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
X
Xingchen Liu
Institute of Computing Technology, Chinese Academy of Sciences
X
Xingjian Tian
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Bing Lu
Bing Lu
Simon Fraser University
Remote SensingEnvironmental ChangeVegetation EcologyUAV
S
Shengkai Lyu
Institute of Computing Technology, Chinese Academy of Sciences
S
Shengquan Yin
University of Science and Technology of China
Wenjing Huang
Wenjing Huang
RAND Corporation
PsychometricsStructural Equation ModelingItem Response TheoryCyber Security
Z
Zheng Wei
Institute of Computing Technology, Chinese Academy of Sciences
H
Hairui Zhao
Institute of Computing Technology, Chinese Academy of Sciences
G
Guangming Tan
Institute of Computing Technology, Chinese Academy of Sciences
Dingwen Tao
Dingwen Tao
Chinese Academy of Sciences, IEEE/ACM Senior Member
High Performance ComputingData ReductionDeep LearningSystems for MLGPU