🤖 AI Summary
Current large language models (LLMs) rely solely on endpoint question-answering in theory-of-mind (ToM) evaluations, which fails to reveal whether they genuinely construct internal representations of mental states that underpin social reasoning—particularly when handling disagreement, dynamic changes, or false beliefs. To address this limitation, this work introduces OmniToM, a novel benchmark that pioneers an explicit belief modeling paradigm, requiring models to represent all characters’ knowledge, intentions, emotions, and false beliefs in a unified, minimal propositional format. Built upon 895 narratives and 22,343 human-annotated belief statements, OmniToM features a seven-dimensional labeling scheme and a fine-grained evaluation framework capturing recursive depth, truth value, and epistemic accessibility. Zero-shot evaluations reveal that existing models exhibit significant character-specific belief tracking deficits, especially in translating objective facts into character-relative beliefs and modeling shared mental states.
📝 Abstract
Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer to a social reasoning query. This paradigm obscures whether the model actually constructs the underlying mental-state representations required for robust reasoning, particularly in scenarios involving divergent, evolving, or mistaken beliefs. In order to address this research gap, we introduce OmniToM, a benchmark that directly evaluates these representations by requiring explicit modeling of belief structures for all relevant actors within a narrative. These structures are composed of belief propositions: minimal statements of what an actor takes to be true about the world or another actor's mental state, allowing knowledge, intentions, emotions, and false beliefs to be analyzed in a common format. Models are evaluated in two stages: Stage 1: Belief Extraction, which extracts from the story the beliefs relevant to its social dynamics, and Stage 2: Belief Labeling, which assigns each belief a seven-dimensional schema label covering recursive order, truth status, knowledge access, explicitness, content type, mental source, and context. Built from 895 stories from the existing ToMBench story corpus and augmented with 22,343 labeled belief propositions, OmniToM uses a human-calibrated LLM-assisted annotation pipeline. Across diverse models in zero-shot evaluation, OmniToM reveals an actor-specific belief-tracking bottleneck: current LLMs struggle with the knowledge-access and representational decisions required to transform narrative facts into actors' beliefs and shared mental states.