Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

📅 2024-06-19
🏛️ arXiv.org
📈 Citations: 5
Influential: 1
📄 PDF
🤖 AI Summary
This study investigates whether large multimodal models possess human-like theory of mind (ToM) capabilities—specifically, spatiotemporal reasoning about beliefs, intentions, and emotions in dynamic video scenes. To this end, we propose the first end-to-end video-to-text ToM reasoning framework, introducing a novel keyframe retrieval mechanism to explicitly expose the model’s internal reasoning trajectory. We further construct a video-centric ToM benchmark and a probe-based evaluation methodology. Experimental results demonstrate that multimodal large language models exhibit emergent video-based ToM capabilities: they substantially outperform text-only baselines on social-emotional reasoning (+23.6% accuracy) and yield highly interpretable, stepwise reasoning traces grounded in visual evidence. Our work establishes a new paradigm for multimodal cognitive modeling and provides empirical foundations for developing trustworthy, socially aware AI systems.

Technology Category

Application Category

📝 Abstract
Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.
Problem

Research questions and friction points this paper is trying to address.

Examining emotional and social reasoning in multimodal video models
Developing pipeline for theory-of-mind reasoning using video and text
Enabling explicit ToM reasoning through key frame retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal video LLM pipeline for ToM reasoning
Key frame retrieval for explicit mental state analysis
Spatio-temporal reasoning on dynamic social video content
🔎 Similar Papers
No similar papers found.
Z
Zhawnen Chen
School of Data Science, University of Virginia
Tianchun Wang
Tianchun Wang
The Pennsylvania State University
machine learning
Y
Yizhou Wang
Northeastern University
Michal Kosinski
Michal Kosinski
Stanford University
Psychology of Artificial IntelligencePersonalityPsychometrics
X
Xiang Zhang
The Pennsylvania State University
Y
Yun Fu
Northeastern University
S
Sheng Li
School of Data Science, University of Virginia