Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

254K/year

🤖 AI Summary

To address the high computational complexity of multi-head attention (MHA) and large key-value (KV) cache overhead in long-context Transformers—leading to prohibitive training and inference costs—this paper proposes Compressed Convolutional Attention (CCA). CCA is the first method to jointly project queries, keys, and values into a shared low-dimensional latent space, where full attention is computed; it further employs head sharing and orthogonality constraints to achieve Pareto-optimal trade-offs between FLOPs and memory. Integrated with dimensionality-reduction projections, convolutional compression of representations, and CUDA-fused kernels, CCA forms the CCGQA architecture. Experiments demonstrate that CCA significantly outperforms grouped-query attention (GQA) and multi-query attention (MLA) under identical KV cache budgets. In MoE models, it achieves 8× parameter compression without accuracy degradation, reduces prefill latency by 1.7×, and accelerates backward propagation by 1.3×.

Technology Category

Application Category

📝 Abstract

Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache, speeding decode, but leave compute, which determines prefill and training speed, largely unchanged. We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space. This simple design dramatically cuts parameters, KV-cache, and FLOPs all at once by the desired compression factor. Because CCA is orthogonal to head-sharing, we combine the two to form Compressed Convolutional Grouped Query Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier so that users can tune compression toward either FLOP or memory limits without sacrificing quality. Experiments show that CCGQA consistently outperforms both GQA and MLA at equal KV-cache compression on dense and MoE models. Additionally, we show that CCGQA outperforms all other attention methods on MoE models with half the KV-cache of GQA and MLA, achieving an 8x KV-cache compression with no drop in performance compared to standard MHA. CCA and CCGQA also dramatically reduce the FLOP cost of attention which leads to substantially faster training and prefill than existing methods. On H100 GPUs, our fused CCA/CCGQA kernel reduces prefill latency by about 1.7x at a sequence length of 16k relative to MHA, and accelerates backward by about 1.3x.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic compute and KV-cache growth in transformer attention mechanisms

Compresses queries, keys, and values into shared latent space for efficiency

Enables faster training and inference while maintaining model performance quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Down-projects queries, keys, values into shared latent space

Combines compression with head-sharing for Pareto frontier optimization

Reduces parameters, KV-cache, and FLOPs simultaneously via compression

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models