🤖 AI Summary
Large language models (LLMs) struggle to reliably detect errors in multi-step reasoning tasks—especially when single-step output evaluation fails. Method: We propose a novel self-assessment framework that embeds confidence estimation layer-by-layer into the reasoning process, enabling fine-grained step-by-step scoring instead of conventional holistic scoring. Contribution/Results: This is the first work to systematically extend self-assessment mechanisms to multi-step reasoning chains. Our core insight is that modeling confidence at each reasoning step enables earlier and more accurate identification of intermediate errors. Evaluated on two mainstream multi-step reasoning benchmarks, our method achieves up to a 15% relative improvement in error detection performance (AUC-ROC), significantly enhancing the trustworthiness and controllability of LLMs in complex reasoning scenarios.
📝 Abstract
Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.