Training-free Truthfulness Detection via Value Vectors in LLMs

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Large language models (LLMs) frequently generate factually incorrect content, necessitating efficient and interpretable factual consistency detection methods. Existing probe-based approaches require supervised training and suffer from poor generalization; unsupervised methods like NoVo rely solely on attention mechanisms, overlooking critical factual recall signals known to reside in MLP modules. Method: We discover, for the first time, that statistical properties—such as variance and entropy—of MLP value vectors exhibit strong correlation with output factual consistency. Leveraging this insight, we propose a zero-shot, training-free detector grounded exclusively in MLP value-vector statistics. Contribution/Results: Our method is highly interpretable and computationally lightweight. On the NoVo benchmark, it significantly outperforms both NoVo and log-likelihood baselines. This demonstrates that MLP layers encode rich, robust factual signals—challenging the prevailing overreliance on attention mechanisms for truthfulness assessment.

Technology Category

Application Category

📝 Abstract

Large language models often generate factually incorrect outputs, motivating efforts to detect the truthfulness of their content. Most existing approaches rely on training probes over internal activations, but these methods suffer from scalability and generalization issues. A recent training-free method, NoVo, addresses this challenge by exploiting statistical patterns from the model itself. However, it focuses exclusively on attention mechanisms, potentially overlooking the MLP module-a core component of Transformer models known to support factual recall. In this paper, we show that certain value vectors within MLP modules exhibit truthfulness-related statistical patterns. Building on this insight, we propose TruthV, a simple and interpretable training-free method that detects content truthfulness by leveraging these value vectors. On the NoVo benchmark, TruthV significantly outperforms both NoVo and log-likelihood baselines, demonstrating that MLP modules-despite being neglected in prior training-free efforts-encode rich and useful signals for truthfulness detection. These findings offer new insights into how truthfulness is internally represented in LLMs and motivate further research on scalable and interpretable truthfulness detection.

Problem

Research questions and friction points this paper is trying to address.

Detecting factual inaccuracies in LLM outputs without training probes

Overcoming scalability issues in existing truthfulness detection methods

Leveraging MLP value vectors for interpretable truthfulness assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages truth-related patterns in MLP value vectors

Proposes training-free TruthV method for detection

Outperforms baselines by utilizing MLP module signals

🔎 Similar Papers

No similar papers found.