Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video-based affective analysis models face bottlenecks in fine-grained multimodal co-modeling of micro-expressions and speech, compounded by a scarcity of high-quality, multimodal emotional datasets. To address this, we propose the first multimodal affective analysis framework explicitly designed for micro-expression–speech alignment. Our approach introduces a two-tier annotated dataset (24K coarse-grained + 3.5K fine-grained samples), incorporating both self-supervised and human-refined annotations. We design a facial micro-expression encoder and a temporal audio modeling module, enabling cross-modal fine-grained alignment within a unified representation space. Additionally, we incorporate instruction-tuned joint optimization to simultaneously enhance emotion recognition and reasoning capabilities. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, with significant improvements in discriminative accuracy and interpretability for subtle emotions—particularly contempt and confusion.

Technology Category

Application Category

📝 Abstract
Understanding emotions accurately is essential for fields like human-computer interaction. Due to the complexity of emotions and their multi-modal nature (e.g., emotions are influenced by facial expressions and audio), researchers have turned to using multi-modal models to understand human emotions rather than single-modality. However, current video multi-modal large language models (MLLMs) encounter difficulties in effectively integrating audio and identifying subtle facial micro-expressions. Furthermore, the lack of detailed emotion analysis datasets also limits the development of multimodal emotion analysis. To address these issues, we introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations, respectively. These datasets allow models to learn from diverse scenarios and better generalize to real-world applications. Moreover, in addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM, enabling the MLLM to effectively unify audio and the subtle facial cues for emotion understanding. By aligning these features within a unified space and employing instruction tuning in our proposed datasets, our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Emotion Analysis
Facial Micro-Expressions
Voice Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality Datasets
Integrated Facial Expression Analysis
Enhanced Emotion Understanding
🔎 Similar Papers
No similar papers found.
Qize Yang
Qize Yang
Tongyi Lab, Alibaba Group
Computer VisionDeep Learning
D
Detao Bai
Tongyi Lab, Alibaba Group
Yi-Xing Peng
Yi-Xing Peng
Sun Yat-sen University
X
Xihan Wei
Tongyi Lab, Alibaba Group