🤖 AI Summary
Quantifying factual uncertainty in black-box large language models (LLMs) remains challenging due to susceptibility to contextual bias, leading to inconsistent responses across paraphrased queries. Method: We propose a novel multi-agent collaborative uncertainty quantification paradigm: diverse perspectives are generated via paraphrased questioning, and response consistency is measured using entropy over agent outputs; this entropy then triggers an interpretable, active refusal mechanism. Contribution/Results: This work is the first to integrate multi-agent interaction and perspective diversity into LLM uncertainty modeling, uncovering the “knowing-but-unstable” phenomenon—where models possess factual knowledge yet exhibit unstable retrieval across viewpoints. Experiments demonstrate significant improvements over traditional self-consistency baselines in reliability prediction accuracy and hallucination detection. Empirical analysis further reveals pervasive cross-perspective inconsistency in factual retrieval among mainstream LLMs.
📝 Abstract
Quantifying the uncertainty in the factual parametric knowledge of Large Language Models (LLMs), especially in a black-box setting, poses a significant challenge. Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty. Models might respond consistently to the origin query with a wrong answer, yet respond correctly to varied questions from different perspectives about the same query, and vice versa. In this paper, we propose a novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using multi-agent interaction under the assumption that if a model is certain, it should consistently recall the answer to the original query across a diverse collection of questions about the same original query. We further implement an abstention policy to withhold responses when uncertainty is high. Our method offers a more accurate prediction of the model's reliability and further detects hallucinations, outperforming other self-consistency-based methods. Additionally, it demonstrates that existing models often fail to consistently retrieve the correct answer to the same query under diverse varied questions even when knowing the correct answer.