Transformers Linearly Represent Highly Structured World Models

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

225K/year
🤖 AI Summary
This study investigates whether Transformers construct internal world models aligned with the structural constraints of their task domain, using Sudoku solving as a case study. An 8-layer Transformer is trained and subjected to fine-grained mechanistic interpretability analyses of its internal representations and MLP neurons. The findings reveal that the model organizes information not at the level of individual cells, but according to Sudoku’s higher-order constraints—rows, columns, and boxes—and spontaneously develops sparse, monosemantic, and fully interpretable “naked single” decision circuits. The authors successfully reverse-engineer the model’s end-to-end reasoning algorithm, demonstrating that its internal world model is shaped by the algebraic structure of the task’s constraints, thereby exhibiting a high degree of organization and interpretability.
📝 Abstract
Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8-layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We establish two results. First, the model builds a substructure world model: it does not represent the board state cell by cell, as a human analyst would expect, but organizes information around the rows, columns, and boxes that Sudoku's constraints act on. Second, we identify a naked-single circuit: a small set of dedicated neurons in the final MLP layer, each individually detecting when exactly one digit remains possible for a specific cell, and reliably promoting that digit. These findings show that the geometry of an emergent world model is shaped by the constraint algebra of the domain, not its surface presentation, and that the resulting decision circuit is sparse, monosemantic, and fully interpretable. More broadly, they demonstrate that mechanistic interpretability tools can recover an end-to-end algorithmic account of how a transformer solves a combinatorial reasoning task.
Problem

Research questions and friction points this paper is trying to address.

transformer
world model
mechanistic interpretability
structured representation
combinatorial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

world model
mechanistic interpretability
structured representation
sparse circuit
combinatorial reasoning
🔎 Similar Papers
2023-12-17Bulletin of the American Mathematical SocietyCitations: 59