🤖 AI Summary
This work addresses the high computational cost and “overthinking” issues prevalent in large reasoning models when tackling complex tasks, particularly exacerbated during reinforcement learning (RL) training. To mitigate these challenges, the authors propose BFS-PO, a novel algorithm that integrates best-first search (BFS) with RL and introduces a maximum-entropy-based backtracking mechanism. This approach dynamically prunes redundant reasoning steps during training, guiding the model to learn concise yet correct inference paths. Experimental results demonstrate that BFS-PO consistently improves reasoning accuracy while significantly reducing output length across multiple benchmarks and diverse base models, achieving efficient and precise reasoning optimization.
📝 Abstract
Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.