Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address reliability issues stemming from hallucinations in large language models (LLMs) deployed in high-stakes domains such as healthcare and finance, this paper proposes the first response-level, zero-shot uncertainty quantification (UQ) framework. Methodologically, it unifies three complementary UQ signal sources—black-box metrics (e.g., output entropy and token consistency), white-box metrics (e.g., gradient and hidden-state variance), and LLM-as-a-judge prompt-based evaluation—and introduces a customizable weighted ensemble mechanism for dynamic adaptation across scenarios, yielding standardized confidence scores in [0,1]. Key contributions include: (i) establishing the first response-level UQ paradigm for LLMs, and (ii) open-sourcing UQLM, a plug-and-play UQ toolkit. Evaluated on multiple benchmark QA tasks, the ensemble achieves an average AUROC gain of 8.2% over the best single baseline, significantly outperforming both individual metrics and state-of-the-art hallucination detection methods.

Technology Category

Application Category

📝 Abstract

Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Detect hallucinations in Large Language Models (LLMs) effectively

Propose a zero-resource framework for real-world hallucination detection

Enhance LLM reliability with tunable ensemble confidence scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts black-box, white-box, LLM Judge UQ techniques

Introduces tunable ensemble for flexible confidence scoring

Provides Python toolkit UQLM for streamlined implementation

🔎 Similar Papers

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph