KaVa: Latent Reasoning via Compressed KV-Cache Distillation

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) rely on explicit chain-of-thought (CoT) prompting for multi-step reasoning, yet this approach incurs high computational overhead and exhibits redundancy. Existing implicit reasoning methods, lacking effective supervision, underperform on complex natural language inference tasks. To address this, we propose KaVa, the first framework to leverage compressed key-value (KV) caches as continuous latent supervision signals. KaVa employs self-distillation to align the student model’s layer-wise KV trajectories with those of a teacher model, enabling end-to-end knowledge transfer from explicit CoT to compact implicit reasoning. Our method integrates KV cache compression, continuous latent token modeling, and self-distillation—requiring no manual annotation of reasoning steps. Experiments across multiple natural language inference benchmarks demonstrate that KaVa significantly outperforms strong baselines, mitigates accuracy degradation during inference, and maintains high efficiency and scalability.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work, we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs of verbose reasoning traces
Providing supervision for latent reasoning without token correspondence
Aligning compressed KV-cache knowledge with latent reasoning steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills compressed KV-cache into latent reasoning
Aligns stepwise KV trajectories via latent tokens
Uses abstract KV-cache as supervision signal
🔎 Similar Papers
No similar papers found.
Anna Kuzina
Anna Kuzina
Senior Researcher, Qualcomm
M
Maciej Pioro
IDEAS NCBR / IPPT PAN
P
Paul N. Whatmough
Qualcomm AI Research
B
Babak Ehteshami Bejnordi
Qualcomm AI Research