Language Models use Lookbacks to Track Beliefs

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) possess theory of mind (ToM) capabilities by examining how they represent agents’ false beliefs—i.e., beliefs inconsistent with reality. Method: Leveraging behavioral analysis of Llama-3-70B-Instruct on dual-agent, dual-object causal narratives, we identify and formally characterize a novel “lookback” mechanism comprising three operations: binding, answer retrieval, and visibility assessment. Using causal mediation analysis, low-rank subspace decomposition of residual streams, Ordering ID encoding, and a controllable dataset, we localize and intervene upon critical attention and MLP modules. Contribution/Results: We demonstrate that the model dynamically updates structured belief triplets (agent-object-state) within a low-rank residual subspace. This provides the first mechanistic evidence of compositional, interpretable ToM reasoning in LLMs, establishing a new paradigm for explanation-driven, mind-like modeling.

Technology Category

Application Category

📝 Abstract

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze Llama-3-70B-Instruct's ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset that consists of simple stories where two characters each separately change the state of two objects, potentially unaware of each other's actions. Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating reference information about them, represented as their Ordering IDs (OIs) in low rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the corresponding state OI and then an answer lookback retrieves the state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into the LM's belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.

Problem

Research questions and friction points this paper is trying to address.

How LMs represent characters' beliefs differing from reality

Analyzing LM's belief tracking using causal mediation and abstraction

Discovering lookback mechanism for belief updates in LMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses lookback mechanism for belief tracking

Binds character-object-state triples with Ordering IDs

Generates visibility IDs for character relations

🔎 Similar Papers

No similar papers found.