InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of long-context retrieval-augmented generation (RAG) during inference, which stems from excessive pre-filled retrieved content and the neglect of effective information propagation in existing KV cache recomputation methods. To this end, the study introduces an information-flow perspective into KV cache optimization for the first time. It proposes a query-guided attention norm to identify critical tokens and leverages the geometric consistency of RoPE positional encoding to reconstruct global positional information. This enables information-flow-aware chunk reordering and efficient recomputation. Evaluated on both large language models and vision-language models, the method significantly improves inference efficiency and accuracy in long-context question answering under comparable computational overhead.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) for long-context question answering is bottlenecked by inference-time prefilling over large retrieved contexts. A common strategy is to precompute key-value (KV) caches for individual documents and selectively recompute a small subset of tokens to restore global causal dependencies, but existing methods rely on heuristics or representation discrepancies without modeling whether selected tokens can effectively influence generation. We cast selective KV recomputation as an information flow problem and show that a simple attention-norm signal from the query reliably identifies tokens that are both semantically relevant and structurally positioned to propagate information, when computed under an inference-consistent RoPE geometry. We therefore reconstruct global positional assignments for retrieved chunks and introduce an information-flow-guided chunk reordering strategy. Experiments on LLM and VLM benchmarks demonstrate consistent gains over prior methods under comparable efficiency budgets.
Problem

Research questions and friction points this paper is trying to address.

retrieval-augmented generation
KV cache recomputation
long-context
information flow
causal dependencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

information flow
KV cache recomputation
retrieval-augmented generation
RoPE geometry
long-context QA