🤖 AI Summary
This work addresses key challenges in multimodal emotion understanding—namely, the scarcity of high-quality annotated data, inconsistent evaluation benchmarks, and limited model reasoning capabilities—by proposing an end-to-end unified framework. The approach introduces a novel multi-view spatiotemporal token encoder that operates without face detection, a convolutional attention pre-fusion module enabling local-global interactions, and a curriculum learning strategy for instruction-tuning LLaMA2 on emotion-related tasks. Furthermore, the authors construct MMEVerse, the first large-scale, standardized multimodal emotion benchmark, integrating 12 public datasets into a unified format encompassing 130,000 training and 36,000 test clips across 18 subtasks. Experimental results demonstrate significant performance gains in both emotion recognition and free-form reasoning tasks.
📝 Abstract
Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.