Long-Context Inference with Retrieval-Augmented Speculative Decoding

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the computational inefficiency caused by KV cache overhead in long-context large language model (LLM) inference, this paper proposes Retrieval-Augmented Speculative Decoding (RAPID). Methodologically, RAPID introduces two key innovations: (1) a novel RAG-based drafter paradigm that dynamically compresses draft contexts via retrieval, enabling more powerful models to serve as the draft LLM; and (2) an inference-time knowledge transfer mechanism that dynamically calibrates the target output distribution to mitigate performance degradation in speculative decoding under long contexts. Evaluated on InfiniteBench, RAPID improves the accuracy of LLaMA-3.1-8B from 39.33 to 42.83 while accelerating inference by over 2×. Crucially, it maintains robustness and generation quality even with context lengths exceeding 32K tokens.

Technology Category

Application Category

📝 Abstract
The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference, particularly in managing key-value (KV) caches, presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We present Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter-a draft LLM operating on shortened retrieval contexts-to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer dynamic that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both approaches, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2x speedups. Our analyses reveal that RAPID achieves robust acceleration beyond 32K context length and demonstrates superior generation quality in real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Efficient long-context inference in LLMs
Reducing KV cache computational overhead
Enhancing generation quality with RAG
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Speculative Decoding
RAG drafter for long-context
Inference-time knowledge transfer dynamic
🔎 Similar Papers
No similar papers found.