Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work proposes Zipage, a high-throughput inference engine for large language models (LLMs) designed to alleviate the memory bottleneck caused by KV caching. The key innovation lies in the first integration of fine-grained, token-level KV cache eviction with PagedAttention, resulting in a compressed variant of PagedAttention. Zipage further introduces an efficient scheduling strategy that supports prefix caching and asynchronous compression. This approach achieves over 2.1× speedup on large-scale mathematical reasoning tasks while retaining 95% of the performance of full KV caching, substantially improving throughput and deployment efficiency.

Technology Category

Application Category

📝 Abstract

With reasoning becoming the generative paradigm for large language models (LLMs), the memory bottleneck caused by KV cache during the decoding phase has become a critical factor limiting high-concurrency service. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency LLM inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95\% of the performance of Full KV inference engines while delivering over 2.1$\times$ speedup.

Problem

Research questions and friction points this paper is trying to address.

KV cache

memory bottleneck

high-concurrency

LLM reasoning

decoding phase

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compressed PagedAttention

KV cache eviction

high-concurrency LLM inference