🤖 AI Summary
This work proposes Zipage, a high-throughput inference engine for large language models (LLMs) designed to alleviate the memory bottleneck caused by KV caching. The key innovation lies in the first integration of fine-grained, token-level KV cache eviction with PagedAttention, resulting in a compressed variant of PagedAttention. Zipage further introduces an efficient scheduling strategy that supports prefix caching and asynchronous compression. This approach achieves over 2.1× speedup on large-scale mathematical reasoning tasks while retaining 95% of the performance of full KV caching, substantially improving throughput and deployment efficiency.
📝 Abstract
With reasoning becoming the generative paradigm for large language models (LLMs), the memory bottleneck caused by KV cache during the decoding phase has become a critical factor limiting high-concurrency service. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency LLM inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95\% of the performance of Full KV inference engines while delivering over 2.1$\times$ speedup.