🤖 AI Summary
Existing KV cache compression methods rely on global top-k selection, which often causes region wipe-out—complete loss of contiguous inference blocks—thereby disrupting logical coherence. This work proposes Adaptive-quality Memory Segmentation (AMS), a framework that shifts compression from token-level competition to region-aware quota allocation. AMS dynamically partitions the cache based on the spatial distribution of attention quality and incorporates an exponential moving average (EMA) smoothing mechanism to stabilize decoding. Notably, AMS introduces the first region-quota guarantee to prevent structural fragmentation and features a plug-and-play architecture compatible with mainstream scorers (e.g., TOVA, Expected Attention) and paged KV systems (e.g., vLLM) without incurring additional steady-state overhead. Experiments demonstrate that AMS consistently mitigates fragmentation and enhances performance across diverse tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain question answering, and sparse retrieval.
📝 Abstract
The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on global Top-k selection triggers Region Wipe-out: the severe eviction of contiguous reasoning blocks that derails logical coherence. To address this, we propose Adaptive Mass-Segmented (AMS) KV Compression, a framework that shifts the paradigm from token-level competition to region-aware quota allocation. AMS adaptively partitions the KV cache based on the spatial distribution of attention mass, ensuring structurally vital reasoning segments receive guaranteed memory quotas. To ensure stability during iterative decoding, an EMA-based smoothing mechanism is incorporated to prevent jitter in segment boundaries. Crucially, AMS is a universal plug-and-play layer that is orthogonal to existing scorers. It can be seamlessly integrated into representative methods such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention. AMS is also system-compatible with modern paged-KV serving frameworks such as vLLM, supporting efficient gather-and-compact KV execution without introducing additional steady-state attention overhead. Extensive experiments across a diverse suite of tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA, and sparse retrieval, demonstrate that AMS consistently mitigates structural fragmentation and boosts model performance.