Rate-Distortion Optimization for Transformer Inference

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational and communication overhead of Transformer inference in cross-device deployment by introducing rate-distortion theory into intermediate representation compression—a first in this domain. The authors propose a lossy compression framework that explicitly trades off bit rate against model accuracy. Grounded in an information-theoretic perspective, they develop an analytical framework and derive PAC-style generalization bounds linking rate and entropy gap. Experimental results on language benchmarks demonstrate that the method substantially reduces communication costs, sometimes even improving accuracy, and consistently outperforms existing sophisticated baselines, thereby validating the practical relevance of the theoretical bounds.

Technology Category

Application Category

📝 Abstract
Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. In this work, we introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade off bitrate against accuracy. Experiments on language benchmarks show that the proposed codec achieves substantial savings with improved accuracy in some cases, outperforming more complex baseline methods. We characterize and analyze the rate-distortion performance of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to define the gap between rate and entropy, and derive some of its bounds. We further develop probably approximately correct (PAC)-style bounds for estimating this gap. For different architectures and tasks, we empirically demonstrate that their rates are driven by these bounds, adding to the explainability of the formulation.
Problem

Research questions and friction points this paper is trying to address.

Rate-Distortion
Transformer Inference
Lossy Compression
Intermediate Representations
Bitrate-Accuracy Tradeoff
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rate-Distortion Optimization
Transformer Compression
Lossy Compression
Information Theory
PAC Bounds
🔎 Similar Papers
No similar papers found.
Anderson de Andrade
Anderson de Andrade
Simon Fraser University
Machine LearningSignal Processing
A
Alon Harell
School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada
I
Ivan V. Bajić
School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada