🤖 AI Summary
This work investigates the differential cognitive capabilities of large language models (LLMs) in Theory of Mind (ToM) versus world modeling (WM). We propose StorySim, a novel framework that employs procedurally generated, controllable synthetic narratives and symbolic storyboard representations to construct multi-level reasoning tasks—including first-order and second-order ToM as well as WM—while enabling clean, controlled evaluation via ablation of confounding variables. Crucially, StorySim avoids pretraining data contamination and supports fine-grained attribution analysis. Experimental results reveal that state-of-the-art LLMs exhibit significantly weaker performance on ToM tasks than on WM tasks; they reason more accurately about human agents than inanimate objects; and they display systematic heuristic biases, including recency bias and event dependency. These findings uncover fundamental, structural limitations in current LLMs’ mental-state reasoning capabilities.
📝 Abstract
We introduce $ exttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $ exttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $ exttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.