PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the memory bottleneck caused by KV cache growth during large language model (LLM) inference, this paper proposes a fine-grained, block-level cache pruning method tailored for PagedAttention. The core innovation lies in a structured block eviction algorithm that requires no CUDA kernel modifications, coupled with an attention-state-driven block importance scoring mechanism—ensuring seamless compatibility with PagedAttention’s memory management. Crucially, the method preserves inference correctness while significantly improving memory utilization efficiency. Experimental evaluation on Llama-family models and LongBench long-context benchmarks demonstrates that, compared to baseline approaches, our method reduces KV cache memory footprint and concurrently improves generation accuracy.

Technology Category

Application Category

📝 Abstract

KV caching significantly improves the efficiency of Large Language Model (LLM) inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the KV cache quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured KV cache pruning strategy that enhances the memory efficiency of vLLM's PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different vLLM pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.

Problem

Research questions and friction points this paper is trying to address.

Reducing KV cache memory bottleneck in LLMs

Improving memory efficiency for long sequences

Enhancing vLLM's PagedAttention without kernel modifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Block-wise eviction algorithm for paged memory layouts

Structured KV cache pruning strategy for vLLM

Seamless integration with PagedAttention without kernel modifications

🔎 Similar Papers

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference