Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current tree-search paradigms for large language model (LLM) inference suffer from domain fragmentation and weak theoretical foundations—particularly regarding the ambiguous role of reward signals: are they transient search heuristics or persistent learning objectives? Method: We propose the first formal unification framework that explicitly decouples search mechanisms, reward modeling, and state transitions. It rigorously distinguishes inference-time search guidance from learning-time reward modeling and establishes a modular taxonomy. By integrating tree-search algorithms, reinforcement learning principles, and LLM fine-tuning techniques, the framework jointly enables inference-time scaling and model self-improvement. Contribution/Results: Our work provides the first principled, component-aware foundation for autonomous agents—enabling interpretable, scalable, and self-evolving systems. It clarifies conceptual boundaries, resolves foundational ambiguities in reward usage, and offers a systematic theoretical pathway toward autonomous agent development.

Technology Category

Application Category

📝 Abstract

Deliberative tree search is a cornerstone of modern Large Language Model (LLM) research, driving the pivot from brute-force scaling toward algorithmic efficiency. This single paradigm unifies two critical frontiers: extbf{Test-Time Scaling (TTS)}, which deploys on-demand computation to solve hard problems, and extbf{Self-Improvement}, which uses search-generated data to durably enhance model parameters. However, this burgeoning field is fragmented and lacks a common formalism, particularly concerning the ambiguous role of the reward signal -- is it a transient heuristic or a durable learning target? This paper resolves this ambiguity by introducing a unified framework that deconstructs search algorithms into three core components: the emph{Search Mechanism}, emph{Reward Formulation}, and emph{Transition Function}. We establish a formal distinction between transient extbf{Search Guidance} for TTS and durable extbf{Parametric Reward Modeling} for Self-Improvement. Building on this formalism, we introduce a component-centric taxonomy, synthesize the state-of-the-art, and chart a research roadmap toward more systematic progress in creating autonomous, self-improving agents.

Problem

Research questions and friction points this paper is trying to address.

Clarifies ambiguous role of reward signals in tree search algorithms

Unifies test-time scaling and self-improvement approaches for LLMs

Establishes formal distinction between search guidance and reward modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework deconstructs search into three core components

Formally distinguishes transient search guidance from durable reward modeling

Introduces component-centric taxonomy for systematic research roadmap

🔎 Similar Papers

Interpretable Contrastive Monte Carlo Tree Search Reasoning