AQUA: Attention via QUery mAgnitudes for Memory and Compute Efficient Inference in LLMs

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The quadratic computational complexity of Transformer attention severely hinders efficient inference for large language models with long contexts. To address this, we propose AQUA, a method that dynamically selects sparse attention dimensions based on query vector magnitudes and applies lightweight, language-agnostic query-key projections learned offline via SVD. Crucially, AQUA directly reduces KV cache size—an innovation not achieved by prior approaches. It offers tunable efficiency–accuracy trade-offs and seamlessly integrates with existing token-compression techniques (e.g., H2O). Evaluated on Llama-3.1-8B, AQUA reduces attention computation by 25% with statistically insignificant performance degradation, while substantially compressing the KV cache and accelerating downstream compression schemes. These results demonstrate AQUA’s effectiveness, generality across architectures and tasks, and practical deployability in real-world long-context inference scenarios.

Technology Category

Application Category

📝 Abstract

The quadratic complexity of the attention mechanism remains a fundamental barrier to scaling Large Language Models (LLMs) to longer contexts, creating a critical bottleneck in both computation and memory. To address this, we introduce AQUA (Attention via QUery mAgnitudes) a novel and versatile approximation strategy that significantly reduces the cost of attention with a graceful performance trade-off. Our method operates in two phases: an efficient offline step where we compute a universal, language agnostic projection matrix via SVD on a calibration dataset, and an online inference step where we project query and key vectors and dynamically select a sparse subset of dimensions based on the query's magnitude. We provide a formal theoretical analysis of AQUA, establishing the break-even point at which it becomes more computationally efficient than standard attention. Our empirical evaluations on state-of-the-art models like Llama-3.1-8B demonstrate that a 25% reduction in the attention dot-product computation can be achieved with a statistically insignificant impact on performance across a wide range of benchmarks. We further showcase the versatility of AQUA by demonstrating its ability to synergistically accelerate existing token eviction methods like H2O and to directly reduce KV-cache memory size. By offering a controllable knob to balance efficiency and accuracy, AQUA provides a practical and powerful tool for making large-scale LLM inference more accessible and sustainable.

Problem

Research questions and friction points this paper is trying to address.

Reduces attention quadratic complexity in LLMs

Minimizes computation and memory bottlenecks

Enables efficient scaling to longer contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

SVD-based universal projection matrix

Dynamic sparse dimension selection via magnitudes

Synergistic acceleration with token eviction methods

🔎 Similar Papers

No similar papers found.

Authors to Follow