Language Models can Self-Improve at State-Value Estimation for Better Search

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the scarcity and high cost of human reward annotations and expert demonstrations in multi-step reasoning tasks (e.g., web navigation), this paper proposes *Self-Teaching Lookahead*—a novel method that dynamically constructs self-supervised signals from environment state transitions, enabling training of a lightweight state-value model without ground-truth rewards to guide efficient tree search in language models. Its core innovation is the first reward-free self-improvement mechanism for value modeling, coupled with the first empirical demonstration that an open-weight 8B-parameter value model achieves instruction-following performance on par with GPT-4o. Experiments show that our approach maintains search quality while improving search efficiency by 20% and reducing training cost by 37×, establishing a new paradigm for low-cost, robust LLM reasoning optimization.

Technology Category

Application Category

📝 Abstract

Collecting ground truth task completion rewards or human demonstrations for multi-step reasoning tasks is often cost-prohibitive and time-consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead, a self-supervised method that leverages state-transition dynamics to train a value model capable of effectively guiding language model-controlled search. We find that moderately sized (8 billion parameters) open-weight value models improved with self-taught lookahead can match the performance of using a frontier LLM such as gpt-4o as the value model. Furthermore, we find that self-taught lookahead improves performance by 20% while reducing costs 37x compared to previous LLM-based tree search, without relying on ground truth rewards.

Problem

Research questions and friction points this paper is trying to address.

Reduces cost and time for multi-step reasoning tasks

Self-supervised method improves state-value estimation

Enhances performance without ground truth rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised method for state-value estimation

Moderately sized open-weight value models

Cost-effective LLM-based tree search improvement

🔎 Similar Papers

No similar papers found.