🤖 AI Summary
To address the dual bottlenecks of incomplete retrieval in Retrieval-Augmented Generation (RAG) and high computational overhead in long-context language models for external knowledge injection, this paper proposes a task-aware key-value (KV) cache compression method. Departing from conventional RAG and indiscriminate compression paradigms, our approach employs a learnable attention module, task-driven sparsification, and knowledge aggregation to dynamically compress multi-document knowledge into compact, task-relevant, and semantically faithful representations—requiring zero or few examples. On LongBench v2, it improves accuracy by 7 percentage points over RAG, achieves a 30× compression ratio, and reduces inference latency from 0.43s to 0.16s; it also significantly outperforms RAG on broad-domain knowledge reasoning tasks. The core contribution is the first formulation of a task-aware KV cache compression paradigm, unifying efficiency and semantic consistency in knowledge injection.
📝 Abstract
Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.