🤖 AI Summary
Current large language models (LLMs) face two fundamental bottlenecks in achieving level-2 AGI reasoning: (1) inefficient “overthinking” during inference, and (2) reliance on external reward models for path correction—stemming from an inability to internalize autonomous backtracking decisions during search. To address this, we propose **Self-Backtracking**, the first mechanism enabling LLMs to autonomously determine *when* and *where* to backtrack within a reasoning trace, dynamically converting slow, deliberative reasoning into efficient fast thinking. Our approach introduces a reinforcement learning–based self-backtracking controller, jointly modeling sequential decision-making, real-time trajectory monitoring, and dynamic path recalibration. Experiments across diverse reasoning benchmarks demonstrate that Self-Backtracking improves accuracy by over 40% relative to optimal-path supervised fine-tuning (OPT-SFT), while substantially enhancing generalization, robustness, and inference efficiency—all without requiring auxiliary reward models.
📝 Abstract
The integration of slow-thinking mechanisms into large language models (LLMs) offers a promising way toward achieving Level 2 AGI Reasoners, as exemplified by systems like OpenAI's o1. However, several significant challenges remain, including inefficient overthinking and an overreliance on auxiliary reward models. We point out that these limitations stem from LLMs' inability to internalize the search process, a key component of effective reasoning. A critical step toward addressing this issue is enabling LLMs to autonomously determine when and where to backtrack, a fundamental operation in traditional search algorithms. To this end, we propose a self-backtracking mechanism that equips LLMs with the ability to backtrack during both training and inference. This mechanism not only enhances reasoning ability but also efficiency by transforming slow-thinking processes into fast-thinking through self-improvement. Empirical evaluations demonstrate that our proposal significantly enhances the reasoning capabilities of LLMs, achieving a performance gain of over 40 percent compared to the optimal-path supervised fine-tuning method. We believe this study introduces a novel and promising pathway for developing more advanced and robust Reasoners.