🤖 AI Summary
Current multimodal large language models (MLLMs) lack rigorous evaluation of Theory of Mind (ToM) capabilities—particularly in long-video, socially grounded contexts. Method: We introduce MOMENTS, the first ToM-oriented, long-video multimodal benchmark, comprising 2,344 multiple-choice questions grounded in authentic social-scenario short films. It spans seven ToM categories—including belief, intention, and deception—and emphasizes deep integration of visual perception with social reasoning. Our methodology innovatively combines long-video contextual modeling, realism-driven video design, and structured multiple-choice assessment. Results: Empirical evaluation reveals that while visual input generally improves performance, state-of-the-art MLLMs remain unable to robustly fuse multimodal signals for accurate mental-state inference—exposing a critical bottleneck in social intelligence. MOMENTS establishes a scalable, ecologically valid framework for benchmarking and advancing embodied social understanding in multimodal AI.
📝 Abstract
Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters' mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI's multimodal understanding of human behavior.