🤖 AI Summary
To address the memory bottleneck caused by KV cache growth during large language model (LLM) inference, this paper proposes a fine-grained, block-level cache pruning method tailored for PagedAttention. The core innovation lies in a structured block eviction algorithm that requires no CUDA kernel modifications, coupled with an attention-state-driven block importance scoring mechanism—ensuring seamless compatibility with PagedAttention’s memory management. Crucially, the method preserves inference correctness while significantly improving memory utilization efficiency. Experimental evaluation on Llama-family models and LongBench long-context benchmarks demonstrates that, compared to baseline approaches, our method reduces KV cache memory footprint and concurrently improves generation accuracy.
📝 Abstract
KV caching significantly improves the efficiency of Large Language Model (LLM) inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the KV cache quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured KV cache pruning strategy that enhances the memory efficiency of vLLM's PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different vLLM pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.