Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address decoding latency caused by the quadratic computational complexity of self-attention in long-context scenarios, this paper proposes Hadamard-SparseAttention—a sparse attention mechanism that constructs compact key-value (KV) representations via Hadamard transforms, bucketing, and 2-bit quantization, and enables efficient top-k retrieval using Manhattan distance approximation. Unlike heuristic sparsification methods, our approach significantly improves critical information recall under extremely low KV cache budgets while preserving accuracy. Experiments on 32K-length sequences demonstrate a 4.4× speedup in self-attention computation and a 1.5× end-to-end inference acceleration. Moreover, Hadamard-SparseAttention supports up to 8× higher sparsity than state-of-the-art methods, with perplexity maintained or reduced—achieving an unprecedented balance between efficiency and modeling fidelity.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic cost of self-attention in long contexts
Improves recall of critical key-value pairs in sparse attention
Maintains accuracy while enabling aggressive sparsity for efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hadamard transform for compact attention representation
Manhattan-distance estimation for efficient top-k selection
Bucketization and 2-bit compression for sparse attention
🔎 Similar Papers
No similar papers found.
Siyuan Yan
Siyuan Yan
Research Fellow@Monash University
AI for MedicineFoundation Model
G
Guo-Qing Jiang
rednote hilab, China
Y
Yuchen Zhang
State Key Laboratory for Novel Software Technology, Nanjing University, China
Xiaoxing Ma
Xiaoxing Ma
Professor of Computer Science and Technology, Nanjing University
software engineeringself-adaptive systemsreliability of machine learning
R
Ran Zhu
rednote hilab, China
Chun Cao
Chun Cao
Nanjing University
J
Jingwei Xu
State Key Laboratory for Novel Software Technology, Nanjing University, China