Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Zipage, a high-throughput inference engine for large language models (LLMs) designed to alleviate the memory bottleneck caused by KV caching. The key innovation lies in the first integration of fine-grained, token-level KV cache eviction with PagedAttention, resulting in a compressed variant of PagedAttention. Zipage further introduces an efficient scheduling strategy that supports prefix caching and asynchronous compression. This approach achieves over 2.1× speedup on large-scale mathematical reasoning tasks while retaining 95% of the performance of full KV caching, substantially improving throughput and deployment efficiency.

Technology Category

Application Category

📝 Abstract
With reasoning becoming the generative paradigm for large language models (LLMs), the memory bottleneck caused by KV cache during the decoding phase has become a critical factor limiting high-concurrency service. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency LLM inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95\% of the performance of Full KV inference engines while delivering over 2.1$\times$ speedup.
Problem

Research questions and friction points this paper is trying to address.

KV cache
memory bottleneck
high-concurrency
LLM reasoning
decoding phase
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compressed PagedAttention
KV cache eviction
high-concurrency LLM inference
prefix caching
asynchronous compression
🔎 Similar Papers
No similar papers found.