A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM confidence estimation methods neglect the relevance between model responses and input context, leading to inaccurate trustworthiness assessment in knowledge-augmented settings. To address this, we propose CRUX, the first framework that jointly models context fidelity—quantified via context-aware entropy reduction under contrastive sampling—and global consistency—enforced through cross-contextual validation—thereby establishing a context-aware dual-metric mechanism that effectively disentangles data uncertainty from model uncertainty. CRUX integrates contrastive sampling, conditional entropy computation, and cross-contextual consistency modeling. Evaluated on five benchmark datasets—CoQA, SQuAD, QuAC, BioASQ, and EduQG—it achieves significantly higher AUROC than state-of-the-art baselines, demonstrating both effectiveness and generalizability in improving confidence calibration for complex question-answering tasks.

Technology Category

Application Category

📝 Abstract
Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.
Problem

Research questions and friction points this paper is trying to address.

Estimating confidence in large language model outputs
Incorporating context relevance for accurate confidence estimation
Improving trustworthiness in safety-critical LLM applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-aware entropy reduction for uncertainty
Unified consistency examination for model uncertainty
Contrastive sampling with and without context