Emergence of Linear Truth Encodings in Language Models

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

While factual and counterfactual statements in large language models (LLMs) exhibit linearly separable subspaces in representation space, the underlying mechanism driving this phenomenon remains unclear. Method: We construct an interpretable single-layer Transformer “toy” model to end-to-end reproduce the emergence of truth encoding; complemented by probe-based analyses and experiments on pretrained language models, we systematically characterize the evolution of linear structure in the representation space. Contribution: We identify co-occurrence patterns of factual statements as the key driver of truth–falsity separation. We uncover a two-stage learning dynamic: first memorizing specific facts, then abstracting linearly separable truth representations. Crucially, we demonstrate that standard autoregressive language modeling objectives alone suffice for models to spontaneously acquire linear truth encodings—yielding significant reductions in next-token prediction loss. This work provides the first mechanistic account of how linear truth structure emerges from unsupervised pretraining.

Technology Category

Application Category

📝 Abstract

Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then -- over a longer horizon -- learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.

Problem

Research questions and friction points this paper is trying to address.

Mechanism behind linear truth subspaces emergence in language models

How factual co-occurrence patterns enable truth-false separation learning

Two-phase learning dynamics for truth encoding in transparent transformer model

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-layer transformer model reproduces truth subspaces

Data distribution encourages learning true-false distinction

Two-phase learning dynamic separates truth linearly

🔎 Similar Papers

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations