Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study investigates the consistency, detectability, and cross-task generalizability of linear “truth directions”—low-dimensional subspaces in large language models (LLMs) that encode factual veracity. Addressing three open questions—(i) whether such directions are consistent across models, (ii) whether their detection requires complex methods, and (iii) whether they generalize to logical reasoning, question answering, in-context learning, and external-knowledge settings—we employ lightweight linear probes trained on atomic factual statements. Our results demonstrate that truth directions are highly stable across strong LLMs, enabling >85% binary truth classification accuracy without fine-tuning. Crucially, these directions generalize from simple declarative statements to complex multi-step reasoning and open-ended QA. Based on this finding, we propose an optional trust-aware QA mechanism that dynamically modulates outputs according to truth-direction alignment, significantly improving output reliability and user trust. This work establishes truth directions as robust, interpretable, and practically deployable geometric signals of factual consistency in LLM representations.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the"truth direction", which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts. Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation. Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources. Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs. These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs. Our code is public at https://github.com/colored-dye/truthfulness_probe_generalization

Problem

Research questions and friction points this paper is trying to address.

Do LLMs universally show consistent truth directions

Are complex methods needed to detect truth directions

How truth direction generalizes across diverse contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifying truth directions in LLMs

Generalizing probes across logical transformations

Applying probes to improve trust in outputs

🔎 Similar Papers

On the Universal Truthfulness Hyperplane Inside LLMs