🤖 AI Summary
Existing LLM confidence estimation methods neglect the relevance between model responses and input context, leading to inaccurate trustworthiness assessment in knowledge-augmented settings. To address this, we propose CRUX, the first framework that jointly models context fidelity—quantified via context-aware entropy reduction under contrastive sampling—and global consistency—enforced through cross-contextual validation—thereby establishing a context-aware dual-metric mechanism that effectively disentangles data uncertainty from model uncertainty. CRUX integrates contrastive sampling, conditional entropy computation, and cross-contextual consistency modeling. Evaluated on five benchmark datasets—CoQA, SQuAD, QuAC, BioASQ, and EduQG—it achieves significantly higher AUROC than state-of-the-art baselines, demonstrating both effectiveness and generalizability in improving confidence calibration for complex question-answering tasks.
📝 Abstract
Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.