CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high latency in long-context large language model inference caused by the growing key-value (KV) cache, a challenge exacerbated by existing pruning methods that lack contextual awareness and struggle to balance generation quality with acceleration. To overcome this, the authors propose CHESS, a co-designed algorithmic and systems-level KV cache management framework that introduces, for the first time, a context-aware hierarchical semantic selection mechanism. This mechanism dynamically reconstructs the critical context required for decoding while leveraging coarse-grained cache operations to minimize data movement overhead. Experiments demonstrate that CHESS achieves superior generation quality compared to full-cache baselines while retaining only 1% of the KV cache, and it improves inference throughput by up to 4.56×, significantly outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other strong baselines. Code is available at \href{https://anonymous.4open.science/r/CHESS-9958/}{https://anonymous.4open.science/r/CHESS/}.
Problem

Research questions and friction points this paper is trying to address.

long-context LLM
KV cache
context-aware pruning
inference latency
semantic selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

context-aware
hierarchical selection
KV cache pruning
algorithm-system co-design
long-context LLM
C
Chao Fei
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
G
Guozhong Li
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Chenxi Liu
Chenxi Liu
Hong Kong Baptist University
Machine LearningCausal DiscoveryCausal InferenceAI for Science
Panos Kalnis
Panos Kalnis
Professor of Computer Science, King Abdullah University of Science and Technology (KAUST)
Big DataCloud computingParallel SystemsGraphsPrivacy