🤖 AI Summary
Non-reasoning language models (e.g., Llama 3) exhibit limited capability in iterative reasoning tasks—such as mathematical problem solving—that require multi-step correction and refinement.
Method: This paper proposes a reflection-backtracking enhancement framework that integrates Monte Carlo Tree Search (MCTS) with natural-language chain-of-thought (CoT) reasoning. Crucially, it formalizes exploration, failure detection, and backtracking—core components of search algorithms—as learnable synthetic CoT trajectories, optimized jointly via supervised fine-tuning and verifiability-driven reinforcement learning. The approach requires no architectural modification to the base model.
Contribution/Results: It endows off-the-shelf language models with self-reflection and path-correction capabilities. Evaluated on MATH-500, AMC 2023, and AIME 2024, the method achieves absolute accuracy gains of +16.0%, +26.9%, and +20.0%, respectively, significantly improving robustness and generalization on multi-step corrective reasoning problems.
📝 Abstract
We introduce ASTRO, the "Autoregressive Search-Taught Reasoner", a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already exhibit strong reasoning capabilities along with search behavior observed even before RL. As a result, it is yet unclear how to boost the reasoning capabilities of other non-reasoner models including Llama 3. ASTRO teaches such models to internalize structured search behavior through a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. By converting search traces into natural language chain-of-thoughts that capture both successes and recoveries from failure, ASTRO bootstraps models with a rich prior for exploration during RL. We finetune our models on these search-derived traces and further improve performance via RL with verifiable rewards. We apply ASTRO to the Llama 3 family of models and achieve absolute performance gains of 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024, especially improving upon challenging problems that require iterative correction. Our results demonstrate that search-inspired training offers a principled way to instill robust reasoning capabilities into open LLMs.