🤖 AI Summary
This work addresses the limitation of existing affective understanding benchmarks, which typically treat emotions as static states and thus fail to evaluate multimodal large language models’ capacity to model dynamic emotional evolution and state transitions within social contexts. To bridge this gap, the authors propose EmoTrans—the first multitask evaluation benchmark focused on emotional dynamics—comprising 12 real-world scenarios, 1,000 annotated videos, and over 3,000 structured question-answer pairs. EmoTrans encompasses four progressively challenging tasks: emotion change detection, state recognition, transition reasoning, and next-emotion prediction. Systematic evaluation of 18 state-of-the-art models reveals that while current approaches perform reasonably well on coarse-grained change detection, they exhibit significant deficiencies in fine-grained dynamic modeling and complex social scenarios, with reasoning-enhancement strategies yielding inconsistent performance gains.
📝 Abstract
Recent multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and generation, and are increasingly used in applications such as social robots and human-computer interaction, where understanding human emotions is essential. However, existing benchmarks mainly formulate emotion understanding as a static recognition problem, leaving it largely unclear whether current MLLMs can understand emotion as a dynamic process that evolves, shifts between states, and unfolds across diverse social contexts. To bridge this gap, we present EmoTrans, a benchmark for evaluating emotion dynamics understanding in multimodal videos. EmoTrans contains 1,000 carefully collected and manually annotated video clips, covering 12 real-world scenarios, and further provides over 3,000 task-specific question-answer (QA) pairs for fine-grained evaluation. The benchmark introduces four tasks, namely Emotion Change Detection (ECD), Emotion State Identification (ESI), Emotion Transition Reasoning (ETR), and Next Emotion Prediction (NEP), forming a progressive evaluation framework from coarse-grained detection to deeper reasoning and prediction. We conduct a comprehensive evaluation of 18 state-of-the-art MLLMs on EmoTrans and obtain two main findings. First, although current MLLMs show relatively stronger performance on coarse-grained emotion change detection, they still struggle with fine-grained emotion dynamics modeling. Second, socially complex settings, especially multi-person scenarios, remain substantially challenging, while reasoning-oriented variants do not consistently yield clear improvements. To facilitate future research, we publicly release the benchmark, evaluation protocol, and code at https://github.com/Emo-gml/EmoTrans.