MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing approaches struggle to structurally infer complex human psychological states from naturalistic behavior, primarily due to class imbalance, visual noise, and domain-specific linguistic discrepancies in real-world settings. To address this, this work introduces MOTOR, a multimodal dataset comprising 1,440 video clips, which establishes the first real-world benchmark for psychological state annotation grounded in self-regulated learning theory. Furthermore, the paper proposes MOTOR-MAS, a multi-agent framework that enables structured psychological reasoning under a zero-shot setting through a collaborative mechanism integrating explicit action recognition, cognitive inference, and emotion analysis. Experimental results demonstrate that MOTOR-MAS outperforms the best single-model baseline by 15.93 Macro-F1 points across behavioral, cognitive, and emotional labels, and surpasses general-purpose multi-agent methods by 10.2 points in internal cognitive prediction.

📝 Abstract

Understanding human mental states from natural behavior is crucial for intelligent systems in the real world. However, most current research focuses on predicting isolated mental state labels, lacking structured annotations of complex interpersonal interactions. To support structured analysis, we introduce MOTOR-Bench, a carefully-designed benchmark with a real-world dataset MOTOR-dataset, containing 1,440 multimodal video clips in collaborative learning scenarios, reflecting key real-world data challenges including natural class imbalance, visual noise, and domain-specific language. Each sample is labeled by educational experts based on self-regulated learning theory. We further evaluate several state-of-the-art multimodal large language models and multi-agent systems in a zero-shot setting on our MOTOR-Bench. However, their performance on this task remains limited, suggesting that existing methods still struggle with structured reasoning from observable behavior to deeper mental states. To address this challenge, we propose a reasoning multi-agent framework, named MOTOR-MAS. It coordinates multiple agents through a structured agent coordination mechanism to infer explicit behaviors, internal cognitions, and psychological emotions. Experimental results show that our MOTOR-MAS outperforms the best single-model benchmark by 15.93 points in Macro-F1 scores for the three labels of behavior, cognition, and emotion, and outperforms the general multi-agent benchmark by 10.2 points in internal cognition prediction.

Problem

Research questions and friction points this paper is trying to address.

mental state understanding

structured reasoning

multimodal behavior analysis

zero-shot learning

interpersonal interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot mental state understanding

multimodal large language models

multi-agent system