🤖 AI Summary
To address the scarcity and high cost of human reward annotations and expert demonstrations in multi-step reasoning tasks (e.g., web navigation), this paper proposes *Self-Teaching Lookahead*—a novel method that dynamically constructs self-supervised signals from environment state transitions, enabling training of a lightweight state-value model without ground-truth rewards to guide efficient tree search in language models. Its core innovation is the first reward-free self-improvement mechanism for value modeling, coupled with the first empirical demonstration that an open-weight 8B-parameter value model achieves instruction-following performance on par with GPT-4o. Experiments show that our approach maintains search quality while improving search efficiency by 20% and reducing training cost by 37×, establishing a new paradigm for low-cost, robust LLM reasoning optimization.
📝 Abstract
Collecting ground truth task completion rewards or human demonstrations for multi-step reasoning tasks is often cost-prohibitive and time-consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead, a self-supervised method that leverages state-transition dynamics to train a value model capable of effectively guiding language model-controlled search. We find that moderately sized (8 billion parameters) open-weight value models improved with self-taught lookahead can match the performance of using a frontier LLM such as gpt-4o as the value model. Furthermore, we find that self-taught lookahead improves performance by 20% while reducing costs 37x compared to previous LLM-based tree search, without relying on ground truth rewards.