SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the scalability bottleneck in long-context inference for large language models (LLMs) caused by explosive KV cache memory growth, this paper proposes SABlock, a semantic-aware cache eviction framework. Methodologically, SABlock introduces three key innovations: (i) structured segmentation based on semantic boundaries; (ii) a segment-guided token importance scoring mechanism; and (iii) a budget-driven, adaptive block-size search strategy that dynamically optimizes compression granularity under semantic integrity constraints. Experimental results demonstrate that SABlock achieves 99.9% accuracy on the NIAH benchmark using only 96 KV entries, reduces peak memory consumption by 46.28% at 128K context length, and accelerates decoding by up to 9.5×—substantially outperforming existing token-, block-, or sentence-level compression methods.

Technology Category

Application Category

📝 Abstract

The growing memory footprint of the Key-Value (KV) cache poses a severe scalability bottleneck for long-context Large Language Model (LLM) inference. While KV cache eviction has emerged as an effective solution by discarding less critical tokens, existing token-, block-, and sentence-level compression methods struggle to balance semantic coherence and memory efficiency. To this end, we introduce SABlock, a underline{s}emantic-aware KV cache eviction framework with underline{a}daptive underline{block} sizes. Specifically, SABlock first performs semantic segmentation to align compression boundaries with linguistic structures, then applies segment-guided token scoring to refine token importance estimation. Finally, for each segment, a budget-driven search strategy adaptively determines the optimal block size that preserves semantic integrity while improving compression efficiency under a given cache budget. Extensive experiments on long-context benchmarks demonstrate that SABlock consistently outperforms state-of-the-art baselines under the same memory budgets. For instance, on Needle-in-a-Haystack (NIAH), SABlock achieves 99.9% retrieval accuracy with only 96 KV entries, nearly matching the performance of the full-cache baseline that retains up to 8K entries. Under a fixed cache budget of 1,024, SABlock further reduces peak memory usage by 46.28% and achieves up to 9.5x faster decoding on a 128K context length.

Problem

Research questions and friction points this paper is trying to address.

Addresses KV cache memory bottleneck in long-context LLM inference

Balances semantic coherence with memory efficiency in cache compression

Adaptively optimizes compression block sizes to preserve semantic integrity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic segmentation aligns compression with linguistic structures

Segment-guided token scoring refines importance estimation

Budget-driven search adaptively determines optimal block size

🔎 Similar Papers

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference