ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

πŸ“… 2026-04-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

207K/year
πŸ€– AI Summary
This work addresses the substantial memory and computational overhead incurred by large language models during long-output generation due to the retention of full key-value (KV) caches. To mitigate this, the authors propose a multi-granularity KV retrieval mechanism that introduces, for the first time, adaptive inference compression and dynamic KV cache selection into long-output scenarios. The approach compresses historical reasoning steps into summaries serving as coarse-grained indices and dynamically β€œzooms in” on relevant segments during decoding to retrieve fine-grained details, enabling synergistic retrieval of both summary and precise information. Integrated with a hierarchical attention mechanism, the method matches baseline performance on mathematical and complex reasoning tasks while reducing inference memory consumption by more than 4Γ—.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically "zooming in" on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than $4\times$. These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.
Problem

Research questions and friction points this paper is trying to address.

KV cache
memory efficiency
long output generation
large language models
reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-granularity retrieval
KV cache compression
memory-efficient decoding
adaptive summarization
hierarchical attention