ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the severe memory bottleneck in tree-based reasoning frameworks—such as Tree-of-Thoughts (ToT)—which arises from retaining extensive intermediate key-value (KV) caches when scaling search depth and breadth. To mitigate this, the authors propose a structure-aware KV cache management mechanism that employs a lightweight value estimator to guide cache allocation. This approach integrates token-level extractive eviction with a lazy rehydration strategy, substantially reducing memory overhead while preserving the ability to backtrack during reasoning. Evaluated on ToT reasoning benchmarks, the method achieves up to a 4× reduction in peak KV cache memory usage compared to full retention, with minimal degradation in reasoning accuracy. Consequently, it enables significantly larger-scale tree search configurations previously hindered by memory constraints.
📝 Abstract
Recent progress in LLM reasoning has increasingly shifted from single-pass generation to explicit search over intermediate reasoning states. Tree-of-Thoughts (ToT) organizes inference to tree-structured search with branching and backtracking, but it substantially amplifies the Key--Value (KV) cache: retaining KV states for a frontier of partial trajectories quickly becomes a memory bottleneck that limits throughput and constrains search depth and width under fixed hardware budgets. We address this challenge by observing that KV reuse in ToT-style inference is governed by search dynamics: near-term decoding depends primarily on the active branch and its ancestors, whereas inactive subtrees have low short-term reuse probability yet must remain recoverable for backtracking. Motivated by this, we propose ArborKV, a structure-aware eviction framework that couples a lightweight value estimator with a tree-aware allocation policy, and performs purely token-extractive eviction with lazy rehydration to support revisits. Experiments on ToT-style reasoning benchmarks show that ArborKV achieves up to ~4x peak KV-memory reduction while preserving near-full-retention accuracy, enabling larger search configurations under fixed device budgets that would otherwise run out of memory.
Problem

Research questions and friction points this paper is trying to address.

KV cache
Tree-of-Thoughts
memory bottleneck
LLM reasoning
tree-based search
Innovation

Methods, ideas, or system contributions that make the work stand out.

structure-aware KV cache
Tree-of-Thoughts
memory-efficient LLM inference
token-extractive eviction
lazy rehydration
Y
Yeqiu Chen
University of Science and Technology of China
Z
Ziyan Liu
University of Science and Technology of China
Z
Zhenxin Huang
University of Science and Technology of China
R
Runquan Gui
University of Science and Technology of China
H
Hong Wang
University of Science and Technology of China
Lei Liu
Lei Liu
Anhui University of Science & Technology
CV