BFS-PO: Best-First Search for Large Reasoning Models

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and “overthinking” issues prevalent in large reasoning models when tackling complex tasks, particularly exacerbated during reinforcement learning (RL) training. To mitigate these challenges, the authors propose BFS-PO, a novel algorithm that integrates best-first search (BFS) with RL and introduces a maximum-entropy-based backtracking mechanism. This approach dynamically prunes redundant reasoning steps during training, guiding the model to learn concise yet correct inference paths. Experimental results demonstrate that BFS-PO consistently improves reasoning accuracy while significantly reducing output length across multiple benchmarks and diverse base models, achieving efficient and precise reasoning optimization.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.
Problem

Research questions and friction points this paper is trying to address.

Large Reasoning Models
overthinking
computational cost
verbose output
Reinforcement Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Best-First Search
Large Reasoning Models
overthinking
reinforcement learning
maximum entropy
🔎 Similar Papers
No similar papers found.