Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work investigates the differential cognitive capabilities of large language models (LLMs) in Theory of Mind (ToM) versus world modeling (WM). We propose StorySim, a novel framework that employs procedurally generated, controllable synthetic narratives and symbolic storyboard representations to construct multi-level reasoning tasks—including first-order and second-order ToM as well as WM—while enabling clean, controlled evaluation via ablation of confounding variables. Crucially, StorySim avoids pretraining data contamination and supports fine-grained attribution analysis. Experimental results reveal that state-of-the-art LLMs exhibit significantly weaker performance on ToM tasks than on WM tasks; they reason more accurately about human agents than inanimate objects; and they display systematic heuristic biases, including recency bias and event dependency. These findings uncover fundamental, structural limitations in current LLMs’ mental-state reasoning capabilities.

Technology Category

Application Category

📝 Abstract

We introduce $ exttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $ exttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $ exttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

Problem

Research questions and friction points this paper is trying to address.

Evaluating theory of mind in language models via synthetic stories

Assessing world modeling capabilities of large language models

Identifying heuristic behaviors like recency bias in model reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Programmable framework for synthetic story generation

Controllable Storyboard for precise perspective manipulation

Evaluates theory of mind and world modeling capabilities

🔎 Similar Papers

No similar papers found.