🤖 AI Summary
While factual and counterfactual statements in large language models (LLMs) exhibit linearly separable subspaces in representation space, the underlying mechanism driving this phenomenon remains unclear.
Method: We construct an interpretable single-layer Transformer “toy” model to end-to-end reproduce the emergence of truth encoding; complemented by probe-based analyses and experiments on pretrained language models, we systematically characterize the evolution of linear structure in the representation space.
Contribution: We identify co-occurrence patterns of factual statements as the key driver of truth–falsity separation. We uncover a two-stage learning dynamic: first memorizing specific facts, then abstracting linearly separable truth representations. Crucially, we demonstrate that standard autoregressive language modeling objectives alone suffice for models to spontaneously acquire linear truth encodings—yielding significant reductions in next-token prediction loss. This work provides the first mechanistic account of how linear truth structure emerges from unsupervised pretraining.
📝 Abstract
Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then -- over a longer horizon -- learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.