Efficient Low Rank Attention for Long-Context Inference in Large Language Models

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address GPU memory bottlenecks caused by KV caching in long-context reasoning, this paper proposes a two-stage low-rank attention framework: during prefilling, full-precision query and key matrices are jointly decomposed via low-rank approximation; during decoding, proxy attention scores are computed in the low-dimensional subspace, with exact KV restoration ensuring output fidelity. Innovatively, a hybrid GPU-CPU cache architecture is introduced, coupled with a hit-driven data loading strategy that dynamically selects tokens via top-k and recency-based criteria—substantially reducing memory footprint and cross-device transfer overhead. Experiments on LLaMA-3-8B and Qwen2.5-7B demonstrate that our method matches or surpasses state-of-the-art sparse attention approaches on RULER and LongBench benchmarks, while achieving significant GPU memory reduction and negligible accuracy degradation.

Technology Category

Application Category

📝 Abstract

As the length of input text grows, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. We introduce Low Rank Query and Key attention (LRQK), a two-stage framework that jointly decomposes the full-precision query and key matrices into compact rank-(r) factors during the prefill stage, and then uses these low-dimensional projections to compute proxy attention scores in (mathcal{O}(lr)) time at each decode step. By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs, thereby preserving exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal loss in accuracy. Our code is available at https://github.com/tenghuilee/LRQK.

Problem

Research questions and friction points this paper is trying to address.

Reducing GPU memory costs for long-context inference in LLMs

Overcoming precision loss from KV quantization and pruning methods

Maintaining exact attention outputs while minimizing CPU-GPU data transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low Rank Query Key attention reduces memory cost

Two-stage framework decomposes matrices into compact factors

Mixed GPU-CPU cache preserves exact attention outputs

🔎 Similar Papers

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers