SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the memory bottleneck imposed by key-value (KV) caches in autoregressive inference of large language models (LLMs), this paper proposes a fine-tuning-free, decompression-free sparse attention framework. The method applies an offline-precomputed orthogonal transformation to rotate KV caches, followed by structured pruning; attention is then computed directly in the resulting sparse space. Crucially, it avoids KV cache reconstruction, supports runtime-adaptive compression ratios, and incorporates a small dense buffer to preserve critical information. By bypassing quantization and cache eviction—both of which incur information loss and decompression overhead—the approach ensures fidelity and efficiency. Experiments demonstrate near-lossless model performance under 50–60% KV cache memory reduction, significantly improving inference throughput for long-context scenarios.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.
Problem

Research questions and friction points this paper is trying to address.

Reducing KV-cache memory footprint in LLMs during autoregressive inference
Eliminating decompression overhead and information loss in cache compression
Enabling runtime-tunable compression levels for dynamic memory adjustment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rotates and prunes KV-cache using orthogonal matrix
Eliminates decompression overhead in attention computation
Enables runtime-tunable compression levels for memory adjustment
🔎 Similar Papers
No similar papers found.
S
Santhosh G S
Centre for Responsible AI, Indian Institute of Technology Madras
S
Saurav Prakash
Department of Electrical Engineering, Indian Institute of Technology Madras
Balaraman Ravindran
Balaraman Ravindran
Professor of Data Science and AI, Wadhwani School of Data Science and AI, IIT Madras
Reinforcement LearningData MiningNetwork AnalysisResponsible AI