🤖 AI Summary
Existing Theory of Mind (ToM) benchmarks are largely confined to static textual scenarios, failing to capture the dynamic, embodied, and multi-agent nature of real-world social interactions.
Method: We introduce the first embodied, multi-agent ToM benchmark for complex social reasoning, integrating first-person real-time multimodal perception with third-person global observation to establish a hierarchical mental-state inference evaluation framework. Leveraging the SoMi simulation environment, we construct a challenging multimodal dataset comprising 35 videos, 363 images, and 1,225 expert-annotated multiple-choice questions.
Contribution/Results: Experiments reveal that state-of-the-art vision-language models underperform humans by 40.1% on first-person ToM tasks and by 26.4% on third-person ToM tasks, highlighting critical deficiencies in modeling dynamic social cognition. This work establishes a novel, scalable paradigm for evaluating embodied ToM, providing both methodological innovation and a publicly accessible benchmark.
📝 Abstract
Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model's ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.