Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

๐Ÿ“… 2025-11-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

188K/year
๐Ÿค– AI Summary
Large language models (LLMs) struggle to reliably detect errors in multi-step reasoning tasksโ€”especially when single-step output evaluation fails. Method: We propose a novel self-assessment framework that embeds confidence estimation layer-by-layer into the reasoning process, enabling fine-grained step-by-step scoring instead of conventional holistic scoring. Contribution/Results: This is the first work to systematically extend self-assessment mechanisms to multi-step reasoning chains. Our core insight is that modeling confidence at each reasoning step enables earlier and more accurate identification of intermediate errors. Evaluated on two mainstream multi-step reasoning benchmarks, our method achieves up to a 15% relative improvement in error detection performance (AUC-ROC), significantly enhancing the trustworthiness and controllability of LLMs in complex reasoning scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.
Problem

Research questions and friction points this paper is trying to address.

Extending self-evaluation techniques to multi-step reasoning tasks
Detecting potential errors in LLM responses through confidence estimation
Improving failure detection reliability for complex multi-step tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stepwise confidence estimation for multi-step tasks
Comparing holistic versus step-by-step scoring methods
Self-evaluating LLMs detect errors in complex reasoning
๐Ÿ”Ž Similar Papers
No similar papers found.