Communication-Efficient Multi-Device Inference Acceleration for Transformer Models

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high inference latency and excessive inter-device bandwidth requirements of Transformer models in bandwidth-constrained edge environments, this paper proposes ASTRA, a communication-efficient multi-device collaborative inference framework. Methodologically, ASTRA introduces (1) a novel sequence-parallel architecture tightly integrated with mixed-precision attention, and (2) noise-augmented quantization coupled with a distributed class-token mechanism to preserve accuracy under ultra-low bandwidth. These innovations drastically reduce inter-device communication volume, enabling operation at as low as 10 Mbps. Experiments on ViT and GPT-2 demonstrate that ASTRA achieves up to 2.64× speedup over single-device inference and outperforms state-of-the-art multi-device approaches by 15.25×. ASTRA thus establishes a scalable, low-communication-overhead paradigm for real-time large-model deployment in edge scenarios.

Technology Category

Application Category

📝 Abstract
Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings. Multi-device inference can reduce latency by parallelizing computation. Yet, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication. ASTRA compresses non-local token embeddings via vector quantization and preserves task accuracy through two optimizations, Noise-Augmented Quantization and Distributed Class Tokens. Experiments on ViT and GPT2 across vision and NLP tasks show that ASTRA achieves up to 2.64X speedups over single-device inference and up to 15.25X speedups over state-of-the-art multi-device inferences, while operating under bandwidths as low as 10 Mbps. ASTRA is open-sourced at https://github.com/xl1990/Astra.
Problem

Research questions and friction points this paper is trying to address.

Reducing high inference latency in Transformer models
Minimizing inter-device communication in multi-device inference
Maintaining accuracy under bandwidth-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequence parallelism for multi-device inference
Mixed-Precision Attention reduces communication
Vector quantization compresses token embeddings
🔎 Similar Papers
No similar papers found.
X
Xiao Liu
University of Massachusetts Amherst, MA 01003-9264
L
Lijun Zhang
University of Massachusetts Amherst, MA 01003-9264
Deepak Ganesan
Deepak Ganesan
Professor of Computer Science, UMass Amherst
Wearable and Sensor computingMobile systemsBackscatter communicationMobile health
Hui Guan
Hui Guan
UMass Amherst
Machine Learning Systems