🤖 AI Summary
The “world model” concept in neural networks lacks an operational definition, hindering rigorous evaluation and comparison.
Method: We propose a formal, testable criterion grounded in linear probing theory, defining a world model as a causal representation of the environment’s latent state space, augmented by a nontriviality condition that excludes superficial fitting to data or task structure.
Contribution/Results: This work provides the first verifiable, reproducible operational definition of world models; explicitly separates representational content (latent states) from computational mechanism (generation process); and establishes a unified conceptual language and experimental benchmark for empirical investigation. By enforcing causal fidelity and nontriviality, our framework significantly improves both the precision and testability of assessing neural networks’ internal causal modeling capacity—enabling principled diagnosis of whether learned representations genuinely capture environment dynamics rather than spurious correlations.
📝 Abstract
We propose a set of precise criteria for saying a neural net learns and uses a "world model." The goal is to give an operational meaning to terms that are often used informally, in order to provide a common language for experimental investigation. We focus specifically on the idea of representing a latent "state space" of the world, leaving modeling the effect of actions to future work. Our definition is based on ideas from the linear probing literature, and formalizes the notion of a computation that factors through a representation of the data generation process. An essential addition to the definition is a set of conditions to check that such a "world model" is not a trivial consequence of the neural net's data or task.