ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Non-reasoning language models (e.g., Llama 3) exhibit limited capability in iterative reasoning tasks—such as mathematical problem solving—that require multi-step correction and refinement. Method: This paper proposes a reflection-backtracking enhancement framework that integrates Monte Carlo Tree Search (MCTS) with natural-language chain-of-thought (CoT) reasoning. Crucially, it formalizes exploration, failure detection, and backtracking—core components of search algorithms—as learnable synthetic CoT trajectories, optimized jointly via supervised fine-tuning and verifiability-driven reinforcement learning. The approach requires no architectural modification to the base model. Contribution/Results: It endows off-the-shelf language models with self-reflection and path-correction capabilities. Evaluated on MATH-500, AMC 2023, and AIME 2024, the method achieves absolute accuracy gains of +16.0%, +26.9%, and +20.0%, respectively, significantly improving robustness and generalization on multi-step corrective reasoning problems.

Technology Category

Application Category

📝 Abstract
We introduce ASTRO, the "Autoregressive Search-Taught Reasoner", a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already exhibit strong reasoning capabilities along with search behavior observed even before RL. As a result, it is yet unclear how to boost the reasoning capabilities of other non-reasoner models including Llama 3. ASTRO teaches such models to internalize structured search behavior through a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. By converting search traces into natural language chain-of-thoughts that capture both successes and recoveries from failure, ASTRO bootstraps models with a rich prior for exploration during RL. We finetune our models on these search-derived traces and further improve performance via RL with verifiable rewards. We apply ASTRO to the Llama 3 family of models and achieve absolute performance gains of 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024, especially improving upon challenging problems that require iterative correction. Our results demonstrate that search-inspired training offers a principled way to instill robust reasoning capabilities into open LLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning in non-reasoner language models like Llama 3
Teaching structured search behavior via synthetic MCTS datasets
Improving iterative correction in math problem-solving tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Monte Carlo Tree Search for synthetic data
Converts search traces into natural language
Applies reinforcement learning with verifiable rewards
J
Joongwon Kim
AI at Meta, University of Washington
Anirudh Goyal
Anirudh Goyal
Mila, Université de Montréal
Machine LearningDeep LearningDeep Reinforcement Learning
L
Liang Tan
AI at Meta
Hannaneh Hajishirzi
Hannaneh Hajishirzi
University of Washington; Allen AI
NLPLangauge modelsAI
S
Srinivasan Iyer
AI at Meta
T
Tianlu Wang
AI at Meta