🤖 AI Summary
Large language models (LLMs) frequently generate factually incorrect outputs—commonly termed “hallucinations.” Existing hallucination detection methods typically rely on access to internal model states (e.g., token probabilities) or external knowledge sources, rendering them inapplicable in black-box API settings. This paper proposes a purely black-box hallucination detection method that requires no internal model access or external resources. It leverages uncertainty-inducing prompts (e.g., “I’m not sure…”) and measures response consistency across multiple stochastic samplings to construct a factual consistency score. The resulting metric is fully API-based, highly generalizable, and easily deployable. Extensive experiments across diverse LLMs and benchmark datasets demonstrate that our method significantly outperforms internal-state– or knowledge-dependent baselines in hallucination identification accuracy. By enabling reliable factuality assessment without model introspection or auxiliary knowledge, this work establishes a novel paradigm for trustworthy LLM deployment in black-box environments.
📝 Abstract
Despite the great advancement of Language modeling in recent days, Large Language Models (LLMs) such as GPT3 are notorious for generating non-factual responses, so-called "hallucination" problems. Existing methods for detecting and alleviating this hallucination problem require external resources or the internal state of LLMs, such as the output probability of each token. Given the LLM's restricted external API availability and the limited scope of external resources, there is an urgent demand to establish the Black-Box approach as the cornerstone for effective hallucination detection. In this work, we propose a simple black-box hallucination detection metric after the investigation of the behavior of LLMs under expression of uncertainty. Our comprehensive analysis reveals that LLMs generate consistent responses when they present factual responses while non-consistent responses vice versa. Based on the analysis, we propose an efficient black-box hallucination detection metric with the expression of uncertainty. The experiment demonstrates that our metric is more predictive of the factuality in model responses than baselines that use internal knowledge of LLMs.