Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

📅 2024-06-19

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

211K/year

🤖 AI Summary

This study investigates whether large multimodal models possess human-like theory of mind (ToM) capabilities—specifically, spatiotemporal reasoning about beliefs, intentions, and emotions in dynamic video scenes. To this end, we propose the first end-to-end video-to-text ToM reasoning framework, introducing a novel keyframe retrieval mechanism to explicitly expose the model’s internal reasoning trajectory. We further construct a video-centric ToM benchmark and a probe-based evaluation methodology. Experimental results demonstrate that multimodal large language models exhibit emergent video-based ToM capabilities: they substantially outperform text-only baselines on social-emotional reasoning (+23.6% accuracy) and yield highly interpretable, stepwise reasoning traces grounded in visual evidence. Our work establishes a new paradigm for multimodal cognitive modeling and provides empirical foundations for developing trustworthy, socially aware AI systems.

Technology Category

Application Category

📝 Abstract

Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.

Problem

Research questions and friction points this paper is trying to address.

Examining emotional and social reasoning in multimodal video models

Developing pipeline for theory-of-mind reasoning using video and text

Enabling explicit ToM reasoning through key frame retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal video LLM pipeline for ToM reasoning

Key frame retrieval for explicit mental state analysis

Spatio-temporal reasoning on dynamic social video content

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs