🤖 AI Summary
Existing uncertainty quantification (UQ) methods for large language models (LLMs) lack a rigorous probabilistic foundation and semantic consistency. Method: This paper proposes the first fully probabilistic UQ framework for LLMs, centered on inverse modeling to characterize the semantic diversity of the input space given an output. It models the input–output relationship via double stochastic walks and introduces Inv-Entropy—a novel uncertainty metric grounded in inverse probability inference. The framework integrates semantic similarity embeddings, a genetic algorithm–assisted adaptive perturbation (GAAP) strategy, and temperature-sensitivity assessment (TSU), the latter requiring no ground-truth labels. Contribution/Results: Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art semantic UQ approaches across multiple tasks. The implementation is open-source, and its modular architecture enables flexible substitution of embedding, perturbation, and similarity components.
📝 Abstract
Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic foundation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/Uncertainty-Quantification-for-LLMs.