🤖 AI Summary
This study investigates whether a universal "truth direction" exists in large language models. By applying probing techniques across activation spaces at multiple layers, the authors systematically evaluate its stability and generalizability across different task types (factual judgment versus reasoning), model depths, and prompt templates. Their findings reveal that the truth direction is not universally stable: it is more prominent in shallow layers for factual tasks and shifts toward deeper layers for reasoning tasks. Moreover, even minor variations in instruction wording substantially degrade its generalization performance. These results challenge the prevailing assumption that the truth direction constitutes a robust, task-agnostic semantic representation, instead demonstrating its high sensitivity to both architectural depth and input formulation.
📝 Abstract
Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.