PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the memory bottleneck caused by KV cache growth during large language model (LLM) inference, this paper proposes a fine-grained, block-level cache pruning method tailored for PagedAttention. The core innovation lies in a structured block eviction algorithm that requires no CUDA kernel modifications, coupled with an attention-state-driven block importance scoring mechanism—ensuring seamless compatibility with PagedAttention’s memory management. Crucially, the method preserves inference correctness while significantly improving memory utilization efficiency. Experimental evaluation on Llama-family models and LongBench long-context benchmarks demonstrates that, compared to baseline approaches, our method reduces KV cache memory footprint and concurrently improves generation accuracy.

Technology Category

Application Category

📝 Abstract
KV caching significantly improves the efficiency of Large Language Model (LLM) inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the KV cache quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured KV cache pruning strategy that enhances the memory efficiency of vLLM's PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different vLLM pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.
Problem

Research questions and friction points this paper is trying to address.

Reducing KV cache memory bottleneck in LLMs
Improving memory efficiency for long sequences
Enhancing vLLM's PagedAttention without kernel modifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Block-wise eviction algorithm for paged memory layouts
Structured KV cache pruning strategy for vLLM
Seamless integration with PagedAttention without kernel modifications
🔎 Similar Papers
No similar papers found.