Confidence Estimation for LLMs in Multi-turn Interactions

πŸ“… 2026-01-05
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 2
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the lack of reliable confidence estimation mechanisms in current large language models during multi-turn dialogues, which hinders their deployment in trustworthy human-AI collaborative systems due to challenges in handling dynamic context accumulation and ambiguity resolution. To this end, we propose the first confidence evaluation framework tailored for multi-turn interactions, grounded in two principles: turn-wise calibration and monotonicity of confidence with respect to added information. We construct a controllable β€œHinter-Guesser” benchmark dataset and introduce InfoECE, a length-normalized evaluation metric. Furthermore, we present a logit-based P(Sufficient) probing method. Experimental results demonstrate that existing confidence estimation approaches exhibit insufficient calibration and monotonicity in multi-turn settings, whereas our probe significantly outperforms them, establishing a new foundation for building trustworthy dialogue systems.

Technology Category

Application Category

πŸ“ Abstract
While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research dominantly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. Reliable confidence estimation in multi-turn settings is critical for many downstream applications, such as autonomous agents and human-in-the-loop systems. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new"Hinter-Guesser"paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. We propose P(Sufficient), a logit-based probe that achieves comparatively better performance, although the task remains far from solved. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.
Problem

Research questions and friction points this paper is trying to address.

confidence estimation
multi-turn interactions
Large Language Models
calibration
hallucination mitigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence estimation
multi-turn interactions
calibration
monotonicity
large language models
πŸ”Ž Similar Papers
No similar papers found.