π€ AI Summary
This work addresses the limitation of existing confidence calibration methods for large language models, which are typically confined to single-turn interactions and fail to account for the dynamic shifts in confidence and user feedback inherent in multi-turn dialogues. To bridge this gap, the study formulates confidence calibration as a dynamic multi-turn task and introduces ECE@T, a novel metric for evaluating calibration reliability across dialogue turns. The authors propose MTCal, a method that leverages dialogue history to refine per-turn calibration, alongside ConfChat, an adaptive decoding strategy that enhances factual consistency without compromising model performance. Experimental results demonstrate that MTCal achieves consistently superior calibration in multi-turn settings, and the overall framework significantly improves the trustworthiness of model responses throughout extended conversations.
π Abstract
Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.