Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address decoding latency caused by the quadratic computational complexity of self-attention in long-context scenarios, this paper proposes Hadamard-SparseAttention—a sparse attention mechanism that constructs compact key-value (KV) representations via Hadamard transforms, bucketing, and 2-bit quantization, and enables efficient top-k retrieval using Manhattan distance approximation. Unlike heuristic sparsification methods, our approach significantly improves critical information recall under extremely low KV cache budgets while preserving accuracy. Experiments on 32K-length sequences demonstrate a 4.4× speedup in self-attention computation and a 1.5× end-to-end inference acceleration. Moreover, Hadamard-SparseAttention supports up to 8× higher sparsity than state-of-the-art methods, with perplexity maintained or reduced—achieving an unprecedented balance between efficiency and modeling fidelity.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic cost of self-attention in long contexts

Improves recall of critical key-value pairs in sparse attention

Maintains accuracy while enabling aggressive sparsity for efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hadamard transform for compact attention representation

Manhattan-distance estimation for efficient top-k selection

Bucketization and 2-bit compression for sparse attention

🔎 Similar Papers

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention