🤖 AI Summary
This study investigates whether large multimodal models possess human-like theory of mind (ToM) capabilities—specifically, spatiotemporal reasoning about beliefs, intentions, and emotions in dynamic video scenes. To this end, we propose the first end-to-end video-to-text ToM reasoning framework, introducing a novel keyframe retrieval mechanism to explicitly expose the model’s internal reasoning trajectory. We further construct a video-centric ToM benchmark and a probe-based evaluation methodology. Experimental results demonstrate that multimodal large language models exhibit emergent video-based ToM capabilities: they substantially outperform text-only baselines on social-emotional reasoning (+23.6% accuracy) and yield highly interpretable, stepwise reasoning traces grounded in visual evidence. Our work establishes a new paradigm for multimodal cognitive modeling and provides empirical foundations for developing trustworthy, socially aware AI systems.
📝 Abstract
Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.