SCBench: A KV Cache-Centric Analysis of Long-Context Methods

📅 2024-12-13
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
Existing long-context large language models (LLMs) suffer from computational and memory efficiency bottlenecks across the full KV cache lifecycle—generation, compression, retrieval, and loading—while mainstream benchmarks neglect real-world cache reuse practices. To address this gap, we propose SCBench, the first KV-cache-centric benchmark for long-context inference, supporting shared-context multi-task evaluation—including string/semantic retrieval, global integration, and multi-turn interaction—via a dual-mode test suite. SCBench uncovers novel phenomena: dynamic sparsity, inter-layer sparse hybrid architectures, and attention distribution shifts. Integrating eight optimization methods into vLLM and SGLang, we evaluate them across eight long-context LLMs. Results show: sublinear-memory schemes degrade significantly in multi-turn settings; sparse encoding with O(n) memory and subquadratic prefilling exhibits strong robustness; and layer-wise sparse hybrid architectures achieve an optimal trade-off between low memory overhead and high accuracy.

Technology Category

Application Category

📝 Abstract
Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. https://aka.ms/SCBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluates long-context LLMs' KV cache lifecycle efficiency.
Addresses gaps in existing benchmarks for real-world use.
Analyzes KV cache generation, compression, retrieval, and loading.
Innovation

Methods, ideas, or system contributions that make the work stand out.

SCBench evaluates KV cache lifecycle comprehensively.
Focuses on KV cache generation, compression, retrieval, loading.
Analyzes long-context methods with shared context tasks.
🔎 Similar Papers
No similar papers found.