Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work investigates whether large language models can internalize and generalize complex search strategies within an unknown tree-structured solution space using only bandit feedback and without prior knowledge. To this end, the authors design a simplified yet representative framework for blind tree search, training Transformer models from scratch to learn diverse search behaviors and employing trajectory-based fine-tuning to elicit inherent search capabilities in pretrained large language models. Experimental results demonstrate that the proposed approach enables models to effectively balance exploration and exploitation across extended planning horizons and deeper tree structures—without relying on external search modules—and achieves performance approaching the optimal. This study provides the first evidence that the Transformer architecture is capable of learning and generalizing sophisticated search strategies entirely from scratch.

Technology Category

Application Category

📝 Abstract

Effective problem solving with Large Language Models (LLMs) can be enhanced when they are paired with external search algorithms. By viewing the space of diverse ideas and their follow-up possibilities as a tree structure, the search algorithm can navigate such a search space and guide the LLM toward better solutions more efficiently. While the search algorithm enables an effective balance between exploitation and exploration of a tree-structured space, the need for an external component can complicate the overall problem-solving process. We therefore pose the following question: Can LLMs or their underlying Transformer architectures approximate a search algorithm? To answer this question, we first introduce a simplified framework in which tree extensions and feedback signals are externally specified, allowing for controlled evaluation of search capabilities. We call this setting unknown tree search with bandit feedback. Within this setting, we show that Transformers are theoretically expressive enough to implement distinct search strategies and can be trained from scratch to approximate those strategies. Our Transformer models exhibit the possibility of generalizing to unseen conditions such as longer horizons or deeper trees. Furthermore, we demonstrate that continued task-focused training unlocks the complete capabilities of a pretrained LLM, by fine-tuning the LLM on search trajectories.

Problem

Research questions and friction points this paper is trying to address.

Transformers

search algorithms

Large Language Models

tree search

bandit feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer

bandit feedback

tree search