Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say"I Don't Know"

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

Large language models often exhibit overconfident hallucinations in closed-book question answering due to an inability to recognize the limits of their knowledge. To address this, this work proposes a training- and retrieval-free abstention mechanism that leverages output inconsistencies among three equivalent prompting strategies—direct, auxiliary, and progressive—as an internal signal of uncertainty. By reframing prompt decomposition as a reliability diagnostic tool, the method combines cross-strategy consistency checking with unsupervised uncertainty estimation to reliably identify unknown questions. Evaluated on multiple multi-hop question answering benchmarks, the approach significantly outperforms conventional uncertainty baselines, achieving consistent improvements in both F1 and AUROC metrics, thereby enhancing the model’s capacity for error detection.

Technology Category

Application Category

📝 Abstract

Large language models often struggle to recognize their knowledge limits in closed-book question answering, leading to confident hallucinations. While decomposed prompting is typically used to improve accuracy, we investigate its impact on reliability. We evaluate three task-equivalent prompting regimes: Direct, Assistive, and Incremental, across different model scales and multi-hop QA benchmarks. We find that although accuracy gains from decomposition diminish in frontier models, disagreements between prompting regimes remain highly indicative of potential errors. Because factual knowledge is stable while hallucinations are stochastic, cross-regime agreement provides a precise signal of internal uncertainty. We leverage this signal to implement a training-free abstention policy that requires no retrieval or fine-tuning. Our results show that disagreement-based abstention outperforms standard uncertainty baselines as an error detector, improving both F1 and AUROC across settings. This demonstrates that decomposition-based prompting can serve as a practical diagnostic probe for model reliability in closed-book QA.

Problem

Research questions and friction points this paper is trying to address.

knowledge gaps

hallucination

model reliability

closed-book QA

uncertainty detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

decomposed prompting

abstention policy

hallucination detection