An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the inconsistent predictions and unreliable uncertainty quantification of large language models (LLMs) in high-stakes scenarios, this paper proposes a multi-model complementary uncertainty estimation method. Unlike conventional single-model calibration paradigms, we introduce an information-theoretic framework to explicitly model complementarity among LLMs, using Jensen–Shannon divergence to quantify inter-model discrepancy and dynamically select and ensemble the subset of models exhibiting optimal calibration performance. Evaluated on binary prediction tasks, our approach significantly outperforms both single-model baselines and naive ensembles: it improves predictive accuracy while substantially enhancing probabilistic calibration—reducing the Expected Calibration Error (ECE) by up to 37%. This demonstrates that model diversity is a critical factor for robust uncertainty quantification in LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naive ensemble baselines.

Problem

Research questions and friction points this paper is trying to address.

Quantify uncertainty in large language models for high-stakes applications

Leverage model diversity to improve uncertainty estimation accuracy

Propose MUSE method for better calibration via ensemble subsets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Jensen-Shannon Divergence for aggregation

Identifies well-calibrated subsets of LLMs

Improves calibration and predictive performance

🔎 Similar Papers

Unlocking the Power of LLM Uncertainty for Active In-Context Example Selection