🤖 AI Summary
This work addresses the unreliability of process reward models (PRMs) in scoring out-of-distribution (OOD) reasoning paths during external search with large language models. To mitigate this issue, the authors propose an uncertainty-aware tree search method that, for the first time, incorporates epistemic uncertainty modeling into reasoning search. Specifically, Monte Carlo Dropout is employed to estimate the uncertainty of PRM predictions on OOD samples, and a reinforcement learning controller dynamically allocates computational resources based on this uncertainty. Theoretical analysis establishes that the proposed strategy achieves a sublinear regret bound, while empirical results demonstrate significant improvements in both accuracy and robustness on complex reasoning tasks.
📝 Abstract
Inference-time reasoning scaling has significantly advanced the capabilities of Large Language Models (LLMs) in complex problem-solving. A prevalent approach involves external search guided by Process Reward Models (PRMs). However, a fundamental limitation of this framework is the epistemic uncertainty of PRMs when evaluating reasoning paths that deviate from their training distribution. In this work, we conduct a systematic analysis of this challenge. We first provide empirical evidence that PRMs exhibit high uncertainty and unreliable scoring on out-of-distribution (OOD) samples. We then establish a theoretical framework proving that while standard search incurs linear regret accumulation, an uncertainty-aware strategy can achieve sublinear regret. Motivated by these findings, we propose Uncertainty-Aware Tree Search (UATS), a unified method that estimates uncertainty via Monte Carlo Dropout and dynamically allocates compute budget using a reinforcement learning-based controller. Extensive experiments demonstrate that our approach effectively mitigates the impact of OOD errors.