🤖 AI Summary
To address the weak cross-domain generalization of large language model (LLM) hallucination detection, this paper proposes PRISM: a framework that explicitly guides LLMs to activate structural representations correlated with textual veracity via prompt engineering, and aligns these representations across transformer layers. PRISM introduces the first prompt-driven internal activation modulation mechanism—requiring no cross-domain labeled data—and integrates structure-aware feature enhancement with prompt engineering grounded in hidden-layer activations. It is plug-and-play compatible with mainstream detectors (e.g., logit-difference scoring, hidden-state statistics). Evaluated on multi-domain benchmarks, PRISM achieves an average 12.3% F1-score improvement over baselines. Under zero-shot cross-domain transfer, it retains over 86% of in-domain performance, significantly enhancing detector robustness and generalizability.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of tasks in different domains. However, they sometimes generate responses that are logically coherent but factually incorrect or misleading, which is known as LLM hallucinations. Data-driven supervised methods train hallucination detectors by leveraging the internal states of LLMs, but detectors trained on specific domains often struggle to generalize well to other domains. In this paper, we aim to enhance the cross-domain performance of supervised detectors with only in-domain data. We propose a novel framework, prompt-guided internal states for hallucination detection of LLMs, namely PRISM. By utilizing appropriate prompts to guide changes to the structure related to text truthfulness in LLMs' internal states, we make this structure more salient and consistent across texts from different domains. We integrated our framework with existing hallucination detection methods and conducted experiments on datasets from different domains. The experimental results indicate that our framework significantly enhances the cross-domain generalization of existing hallucination detection methods.