ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the severe memory bottleneck in tree-based reasoning frameworks—such as Tree-of-Thoughts (ToT)—which arises from retaining extensive intermediate key-value (KV) caches when scaling search depth and breadth. To mitigate this, the authors propose a structure-aware KV cache management mechanism that employs a lightweight value estimator to guide cache allocation. This approach integrates token-level extractive eviction with a lazy rehydration strategy, substantially reducing memory overhead while preserving the ability to backtrack during reasoning. Evaluated on ToT reasoning benchmarks, the method achieves up to a 4× reduction in peak KV cache memory usage compared to full retention, with minimal degradation in reasoning accuracy. Consequently, it enables significantly larger-scale tree search configurations previously hindered by memory constraints.

📝 Abstract

Recent progress in LLM reasoning has increasingly shifted from single-pass generation to explicit search over intermediate reasoning states. Tree-of-Thoughts (ToT) organizes inference to tree-structured search with branching and backtracking, but it substantially amplifies the Key--Value (KV) cache: retaining KV states for a frontier of partial trajectories quickly becomes a memory bottleneck that limits throughput and constrains search depth and width under fixed hardware budgets. We address this challenge by observing that KV reuse in ToT-style inference is governed by search dynamics: near-term decoding depends primarily on the active branch and its ancestors, whereas inactive subtrees have low short-term reuse probability yet must remain recoverable for backtracking. Motivated by this, we propose ArborKV, a structure-aware eviction framework that couples a lightweight value estimator with a tree-aware allocation policy, and performs purely token-extractive eviction with lazy rehydration to support revisits. Experiments on ToT-style reasoning benchmarks show that ArborKV achieves up to ~4x peak KV-memory reduction while preserving near-full-retention accuracy, enabling larger search configurations under fixed device budgets that would otherwise run out of memory.

Problem

Research questions and friction points this paper is trying to address.

KV cache

Tree-of-Thoughts

memory bottleneck

LLM reasoning

tree-based search

Innovation

Methods, ideas, or system contributions that make the work stand out.

structure-aware KV cache

Tree-of-Thoughts

memory-efficient LLM inference