FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address the bottlenecks of linear memory growth in KV caches and quadratic computational complexity in self-attention during long-context extension of large language models (LLMs), this paper identifies, for the first time, significant energy sparsity of KV caches in the frequency domain. We propose a parameter-free, architecture-agnostic iterative frequency-domain compression mechanism: leveraging FFT/iFFT transforms, low-pass filtering, and dynamic truncation-reconstruction to achieve lossless approximate compression of KV caches while preserving critical low-frequency components. The method supports both efficient fine-tuning and inference without modifying model architecture or introducing trainable parameters. Experiments on long-context tasks exceeding 32K tokens demonstrate over 40% GPU memory reduction, a 2.1× inference speedup, and performance approaching that of full KV-cache baselines—substantially improving modeling efficiency and deployability for long-context LLMs.

Technology Category

Application Category

📝 Abstract

Extending the context window in large language models (LLMs) is essential for applications involving long-form content generation. However, the linear increase in key-value (KV) cache memory requirements and the quadratic complexity of self-attention with respect to sequence length present significant challenges during fine-tuning and inference. Existing methods suffer from performance degradation when extending to longer contexts. In this work, we introduce a novel context extension method that optimizes both fine-tuning and inference efficiency. Our method exploits a key observation: in the frequency domain, the energy distribution of the KV cache is primarily concentrated in low-frequency components. By filtering out the high-frequency components, the KV cache can be effectively compressed with minimal information loss. Building on this insight, we propose an efficient compression technique, FreqKV, that iteratively compresses the increasing KV cache to a fixed size in the frequency domain, applicable to both fine-tuning and inference. FreqKV introduces no additional parameters or architectural modifications. With minimal fine-tuning, LLMs can learn to leverage the limited cache that is compressed in the frequency domain and extend the context window efficiently. Experiments on various long context language modeling and understanding tasks demonstrate the efficiency and efficacy of the proposed method.

Problem

Research questions and friction points this paper is trying to address.

Compress KV cache in frequency domain for longer contexts

Reduce memory and complexity in LLM self-attention

Maintain performance with minimal fine-tuning and no architecture changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compresses KV cache in frequency domain

Filters high-frequency components for efficiency

Requires no additional parameters or modifications

🔎 Similar Papers

AutoFlow: An Autoencoder-based Approach for IP Flow Record Compression with Minimal Impact on Traffic Classification