Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work proposes a novel method that systematically leverages the KV cache from large language model (LLM) inference as a lightweight contextual representation for downstream tasks, without incurring additional computation or storage overhead. By directly utilizing the KV cache, the approach circumvents the need to recompute or store full hidden states, enabling efficient inference in two distinct scenarios: chain-of-embedding and fast-slow thinking switching. Experiments conducted on Llama-3.1-8B-Instruct, Qwen2-7B-Instruct, Qwen3-8B, and DeepSeek-R1-Distil-Qwen-14B demonstrate that the method achieves performance on par with or superior to dedicated embedding models in chain-of-embedding tasks, while reducing generated tokens by up to 5.7× in fast-slow thinking switching with negligible accuracy degradation.

Technology Category

Application Category

📝 Abstract

KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: https://github.com/cmd2001/ICLR2026_KV-Embedding.

Problem

Research questions and friction points this paper is trying to address.

KV cache

representation reuse

LLM inference

sampling

reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache

representation reuse

Chain-of-Embedding