Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Large language models (LLMs) struggle to reliably detect errors in multi-step reasoning tasks—especially when single-step output evaluation fails. Method: We propose a novel self-assessment framework that embeds confidence estimation layer-by-layer into the reasoning process, enabling fine-grained step-by-step scoring instead of conventional holistic scoring. Contribution/Results: This is the first work to systematically extend self-assessment mechanisms to multi-step reasoning chains. Our core insight is that modeling confidence at each reasoning step enables earlier and more accurate identification of intermediate errors. Evaluated on two mainstream multi-step reasoning benchmarks, our method achieves up to a 15% relative improvement in error detection performance (AUC-ROC), significantly enhancing the trustworthiness and controllability of LLMs in complex reasoning scenarios.

Technology Category

Application Category

📝 Abstract

Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.

Problem

Research questions and friction points this paper is trying to address.

Extending self-evaluation techniques to multi-step reasoning tasks

Detecting potential errors in LLM responses through confidence estimation

Improving failure detection reliability for complex multi-step tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stepwise confidence estimation for multi-step tasks

Comparing holistic versus step-by-step scoring methods

Self-evaluating LLMs detect errors in complex reasoning

🔎 Similar Papers

No similar papers found.