Inference-Time Hyper-Scaling with KV Cache Compression

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

In Transformer-based large language models, the KV cache size dominates computational and memory overheads, limiting generation length and accuracy. To address this, we propose Dynamic Memory Sparsification (DMS), a differentiable, inference-efficient KV cache compression method. DMS integrates delayed token pruning with implicit representation fusion, achieving 8× KV compression after only 1K fine-tuning steps—substantially outperforming training-free sparse attention baselines. Crucially, DMS enables inference-time scaling without increasing latency or memory footprint, thereby enhancing long-context generation capability. Evaluated on Qwen-R1-32B, DMS improves performance by +9.1, +7.6, and +9.6 points on AIME 2024, GPQA, and LiveCodeBench, respectively. These results demonstrate a significant advancement in the accuracy–efficiency trade-off for KV cache management.

Technology Category

Application Category

📝 Abstract

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$ imes$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression for efficient inference scaling

Maintaining accuracy at high compression ratios

Dynamic Memory Sparsification for delayed token eviction

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression for efficient scaling

Dynamic Memory Sparsification (DMS) method

Delayed token eviction preserves critical information

🔎 Similar Papers

ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache