🤖 AI Summary
In vision autoregressive Transformers, KV caching incurs quadratic memory complexity in sequence length—O(n²d)—constituting a fundamental bottleneck. Method: We formally characterize this issue and rigorously prove that, for token dimension d = Ω(log n), any sequential generation mechanism under standard attention admits an Ω(n²d) lower bound on KV cache memory. Our proof framework leverages random embedding reduction, jointly integrating dimensionality reduction and attention modeling to systematically analyze how structural priors (e.g., sparsity, locality) affect memory efficiency. Contribution/Results: We establish the theoretical impossibility of subquadratic compression under standard attention, demonstrating that breaking this lower bound necessitates explicit structural assumptions—such as sparsity or local connectivity. This provides a foundational theoretical basis for designing memory-efficient vision generative models.
📝 Abstract
A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least $Omega(n^2 d)$ memory, when $d = Omega(log n)$, where $n$ is the number of tokens generated and $d$ is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.