Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in multimodal emotion understanding—namely, the scarcity of high-quality annotated data, inconsistent evaluation benchmarks, and limited model reasoning capabilities—by proposing an end-to-end unified framework. The approach introduces a novel multi-view spatiotemporal token encoder that operates without face detection, a convolutional attention pre-fusion module enabling local-global interactions, and a curriculum learning strategy for instruction-tuning LLaMA2 on emotion-related tasks. Furthermore, the authors construct MMEVerse, the first large-scale, standardized multimodal emotion benchmark, integrating 12 public datasets into a unified format encompassing 130,000 training and 36,000 test clips across 18 subtasks. Experimental results demonstrate significant performance gains in both emotion recognition and free-form reasoning tasks.

Technology Category

Application Category

📝 Abstract
Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.
Problem

Research questions and friction points this paper is trying to address.

multimodal emotion understanding
emotion recognition
affective computing
large language models
emotion benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal emotion understanding
end-to-end multiview encoder
Conv Attention pre-fusion
curriculum instruction tuning
MMEVerse benchmark
🔎 Similar Papers
No similar papers found.