Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Current video-based affective analysis models face bottlenecks in fine-grained multimodal co-modeling of micro-expressions and speech, compounded by a scarcity of high-quality, multimodal emotional datasets. To address this, we propose the first multimodal affective analysis framework explicitly designed for micro-expression–speech alignment. Our approach introduces a two-tier annotated dataset (24K coarse-grained + 3.5K fine-grained samples), incorporating both self-supervised and human-refined annotations. We design a facial micro-expression encoder and a temporal audio modeling module, enabling cross-modal fine-grained alignment within a unified representation space. Additionally, we incorporate instruction-tuned joint optimization to simultaneously enhance emotion recognition and reasoning capabilities. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, with significant improvements in discriminative accuracy and interpretability for subtle emotions—particularly contempt and confusion.

Technology Category

Application Category

📝 Abstract

Understanding emotions accurately is essential for fields like human-computer interaction. Due to the complexity of emotions and their multi-modal nature (e.g., emotions are influenced by facial expressions and audio), researchers have turned to using multi-modal models to understand human emotions rather than single-modality. However, current video multi-modal large language models (MLLMs) encounter difficulties in effectively integrating audio and identifying subtle facial micro-expressions. Furthermore, the lack of detailed emotion analysis datasets also limits the development of multimodal emotion analysis. To address these issues, we introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations, respectively. These datasets allow models to learn from diverse scenarios and better generalize to real-world applications. Moreover, in addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM, enabling the MLLM to effectively unify audio and the subtle facial cues for emotion understanding. By aligning these features within a unified space and employing instruction tuning in our proposed datasets, our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Emotion Analysis

Facial Micro-Expressions

Voice Analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality Datasets

Integrated Facial Expression Analysis

Enhanced Emotion Understanding

🔎 Similar Papers

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention