LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory and efficiency bottlenecks in large language models (LLMs) caused by quadratic growth of KV cache size with sequence length, this paper proposes a training-free KV cache optimization method. The approach features: (1) a cross-layer hierarchical KV cache structure that preserves information gradients across layers; and (2) a dynamic, distance-aware iterative compression mechanism that adaptively prunes redundant entries under a fixed cache budget. By jointly preserving long-range dependencies and ensuring stability during autoregressive generation, the method significantly mitigates out-of-memory risks. Extensive evaluation across benchmarks (e.g., LongBench, GovReport), tasks (question answering, summarization), and mainstream models (Llama-2/3, Qwen) demonstrates consistent effectiveness—yielding an average 12.7% improvement in long-context modeling performance and a 38% reduction in inference GPU memory usage, all without additional training overhead.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache's effectiveness in enhancing LLMs' long-range capabilities. Our code is available at https://github.com/GATECH-EIC/LaCache.
Problem

Research questions and friction points this paper is trying to address.

Optimize KV cache for efficient long-context LLM modeling
Enhance long-range capabilities without memory overflow
Balance cache storage and continuous generation efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ladder-shaped KV cache pattern
Iterative compaction mechanism
Training-free generative inference
🔎 Similar Papers
No similar papers found.