MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) lack rigorous evaluation of Theory of Mind (ToM) capabilities—particularly in long-video, socially grounded contexts. Method: We introduce MOMENTS, the first ToM-oriented, long-video multimodal benchmark, comprising 2,344 multiple-choice questions grounded in authentic social-scenario short films. It spans seven ToM categories—including belief, intention, and deception—and emphasizes deep integration of visual perception with social reasoning. Our methodology innovatively combines long-video contextual modeling, realism-driven video design, and structured multiple-choice assessment. Results: Empirical evaluation reveals that while visual input generally improves performance, state-of-the-art MLLMs remain unable to robustly fuse multimodal signals for accurate mental-state inference—exposing a critical bottleneck in social intelligence. MOMENTS establishes a scalable, ecologically valid framework for benchmarking and advancing embodied social understanding in multimodal AI.

Technology Category

Application Category

📝 Abstract

Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters' mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI's multimodal understanding of human behavior.

Problem

Research questions and friction points this paper is trying to address.

Assessing ToM in multimodal LLMs using realistic scenarios

Evaluating AI's ability to interpret complex human mental states

Improving multimodal integration for social intelligence in AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for Theory of Mind

Long video context windows integration

Diverse multiple-choice questions design

🔎 Similar Papers

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind