SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size

๐Ÿ“… 2025-10-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the scalability bottleneck in long-context inference for large language models (LLMs) caused by explosive KV cache memory growth, this paper proposes SABlock, a semantic-aware cache eviction framework. Methodologically, SABlock introduces three key innovations: (i) structured segmentation based on semantic boundaries; (ii) a segment-guided token importance scoring mechanism; and (iii) a budget-driven, adaptive block-size search strategy that dynamically optimizes compression granularity under semantic integrity constraints. Experimental results demonstrate that SABlock achieves 99.9% accuracy on the NIAH benchmark using only 96 KV entries, reduces peak memory consumption by 46.28% at 128K context length, and accelerates decoding by up to 9.5ร—โ€”substantially outperforming existing token-, block-, or sentence-level compression methods.

Technology Category

Application Category

๐Ÿ“ Abstract
The growing memory footprint of the Key-Value (KV) cache poses a severe scalability bottleneck for long-context Large Language Model (LLM) inference. While KV cache eviction has emerged as an effective solution by discarding less critical tokens, existing token-, block-, and sentence-level compression methods struggle to balance semantic coherence and memory efficiency. To this end, we introduce SABlock, a underline{s}emantic-aware KV cache eviction framework with underline{a}daptive underline{block} sizes. Specifically, SABlock first performs semantic segmentation to align compression boundaries with linguistic structures, then applies segment-guided token scoring to refine token importance estimation. Finally, for each segment, a budget-driven search strategy adaptively determines the optimal block size that preserves semantic integrity while improving compression efficiency under a given cache budget. Extensive experiments on long-context benchmarks demonstrate that SABlock consistently outperforms state-of-the-art baselines under the same memory budgets. For instance, on Needle-in-a-Haystack (NIAH), SABlock achieves 99.9% retrieval accuracy with only 96 KV entries, nearly matching the performance of the full-cache baseline that retains up to 8K entries. Under a fixed cache budget of 1,024, SABlock further reduces peak memory usage by 46.28% and achieves up to 9.5x faster decoding on a 128K context length.
Problem

Research questions and friction points this paper is trying to address.

Addresses KV cache memory bottleneck in long-context LLM inference
Balances semantic coherence with memory efficiency in cache compression
Adaptively optimizes compression block sizes to preserve semantic integrity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic segmentation aligns compression with linguistic structures
Segment-guided token scoring refines importance estimation
Budget-driven search adaptively determines optimal block size
J
Jinhan Chen
School of Computer Science and Technology, University of Science and Technology of China
Jianchun Liu
Jianchun Liu
University of Science and Technology of China
Edge ComputingFederated LearningModel Inference
Hongli Xu
Hongli Xu
University of Science and Technology of China
Software Defined NetworkCooperative CommunicationSensor Networks
X
Xianjun Gao
School of Computer Science and Technology, University of Science and Technology of China
S
Shilong Wang
School of Computer Science and Technology, University of Science and Technology of China