Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual bottlenecks of incomplete retrieval in Retrieval-Augmented Generation (RAG) and high computational overhead in long-context language models for external knowledge injection, this paper proposes a task-aware key-value (KV) cache compression method. Departing from conventional RAG and indiscriminate compression paradigms, our approach employs a learnable attention module, task-driven sparsification, and knowledge aggregation to dynamically compress multi-document knowledge into compact, task-relevant, and semantically faithful representations—requiring zero or few examples. On LongBench v2, it improves accuracy by 7 percentage points over RAG, achieves a 30× compression ratio, and reduces inference latency from 0.43s to 0.16s; it also significantly outperforms RAG on broad-domain knowledge reasoning tasks. The core contribution is the first formulation of a task-aware KV cache compression paradigm, unifying efficiency and semantic consistency in knowledge injection.

Technology Category

Application Category

📝 Abstract
Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhances LLMs by compressing external knowledge efficiently
Improves accuracy and reduces latency in knowledge reasoning tasks
Outperforms RAG and task-agnostic compression methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-aware KV cache compression for LLMs
Zero- or few-shot external knowledge compression
Improved accuracy and reduced inference latency
🔎 Similar Papers
No similar papers found.