Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In vision autoregressive Transformers, KV caching incurs quadratic memory complexity in sequence length—O(n²d)—constituting a fundamental bottleneck. Method: We formally characterize this issue and rigorously prove that, for token dimension d = Ω(log n), any sequential generation mechanism under standard attention admits an Ω(n²d) lower bound on KV cache memory. Our proof framework leverages random embedding reduction, jointly integrating dimensionality reduction and attention modeling to systematically analyze how structural priors (e.g., sparsity, locality) affect memory efficiency. Contribution/Results: We establish the theoretical impossibility of subquadratic compression under standard attention, demonstrating that breaking this lower bound necessitates explicit structural assumptions—such as sparsity or local connectivity. This provides a foundational theoretical basis for designing memory-efficient vision generative models.

Technology Category

Application Category

📝 Abstract
A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least $Omega(n^2 d)$ memory, when $d = Omega(log n)$, where $n$ is the number of tokens generated and $d$ is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.
Problem

Research questions and friction points this paper is trying to address.

Addresses memory overhead in Visual Autoregressive Transformers.
Formalizes KV-cache compression problem for visual token generation.
Proves sub-quadratic memory usage is impossible without constraints.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes KV-cache compression in visual transformers
Proves Ω(n²d) memory lower bound for token generation
Explores sparsity priors for memory efficiency improvements
🔎 Similar Papers
No similar papers found.
B
Bo Chen
Middle Tennessee State University
X
Xiaoyu Li
Stevens Institute of Technology
Y
Yekun Ke
Independent Researcher
Yingyu Liang
Yingyu Liang
The University of Hong Kong
machine learning
Zhenmei Shi
Zhenmei Shi
Senior Research Scientist at MongoDB + Voyage AI; PhD from University of Wisconsin–Madison
Deep LearningMachine LearningArtificial Intelligence
Z
Zhao Song
The Simons Institute for the Theory of Computing at the UC, Berkeley