How to Steer LLM Latents for Hallucination Detection?

πŸ“… 2025-03-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge of detecting hallucinations in large language model (LLM) outputs without parameter fine-tuning, this paper proposes a latent-space steering-based detection method. The core innovation is a lightweight Truthfulness Separator Vector (TSV), which enhances separability between truthful and hallucinated representations in the model’s latent space via directional intervention. The method employs a two-stage TSV learning framework: Stage I generates high-quality pseudo-labels using optimal transport; Stage II refines label reliability through confidence-based filtering, requiring only minimal human annotation. Evaluated across multiple benchmarks, the approach achieves state-of-the-art performance in hallucination detection accuracy. It demonstrates strong cross-model and cross-domain robustness, and supports plug-and-play deployment without architectural modification or retraining of the target LLM.

Technology Category

Application Category

πŸ“ Abstract
Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.
Problem

Research questions and friction points this paper is trying to address.

Detect hallucinations in LLMs using latent space
Separate truthful and hallucinated content effectively
Enhance LLM safety without altering model parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Truthfulness Separator Vector (TSV)
Uses optimal transport-based pseudo-labeling algorithm
Enhances LLM representation space without parameter changes
πŸ”Ž Similar Papers
No similar papers found.
Seongheon Park
Seongheon Park
University of Wisconsin-Madison
Machine LearningReliable AI
X
Xuefeng Du
Department of Computer Sciences, University of Wisconsin-Madison, USA
Min-Hsuan Yeh
Min-Hsuan Yeh
University of Wisconsin Madison
Natural Language Processing
Haobo Wang
Haobo Wang
Zhejiang University
Machine Learning
Y
Yixuan Li
Department of Computer Sciences, University of Wisconsin-Madison, USA