MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This paper addresses continuous emotion recognition in natural dynamic scenes, grounded in the two-dimensional Valence-Arousal (V-A) psychological model and leveraging multimodal (visual, audio, textual) inputs. The proposed method introduces three key innovations: (1) a novel six-way bidirectional cross-modal attention mechanism enabling fine-grained inter-modal alignment; (2) a polar-coordinate representation to jointly encode the periodicity and correlation inherent in the V-A space; and (3) a two-stage feature optimization strategy integrating cross-modal enhancement and intra-modal self-attention. Evaluated on the Aff-Wild2 benchmark using Concordance Correlation Coefficient (CCC) as the primary metric, the approach achieves state-of-the-art performance, demonstrating substantial improvements in both accuracy and robustness for emotion modeling in complex conversational videos.

Technology Category

Application Category

📝 Abstract

This paper introduces MAVEN (Multi-modal Attention for Valence-Arousal Emotion Network), a novel architecture for dynamic emotion recognition through dimensional modeling of affect. The model uniquely integrates visual, audio, and textual modalities via a bi-directional cross-modal attention mechanism with six distinct attention pathways, enabling comprehensive interactions between all modality pairs. Our proposed approach employs modality-specific encoders to extract rich feature representations from synchronized video frames, audio segments, and transcripts. The architecture's novelty lies in its cross-modal enhancement strategy, where each modality representation is refined through weighted attention from other modalities, followed by self-attention refinement through modality-specific encoders. Rather than directly predicting valence-arousal values, MAVEN predicts emotions in a polar coordinate form, aligning with psychological models of the emotion circumplex. Experimental evaluation on the Aff-Wild2 dataset demonstrates the effectiveness of our approach, with performance measured using Concordance Correlation Coefficient (CCC). The multi-stage architecture demonstrates superior ability to capture the complex, nuanced nature of emotional expressions in conversational videos, advancing the state-of-the-art (SOTA) in continuous emotion recognition in-the-wild. Code can be found at: https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW.

Problem

Research questions and friction points this paper is trying to address.

Dynamic emotion recognition using multi-modal data integration

Cross-modal attention mechanism for enhanced emotion feature extraction

Advancing continuous emotion recognition in conversational videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-directional cross-modal attention mechanism

Modality-specific encoders for feature extraction

Polar coordinate emotion prediction alignment

🔎 Similar Papers

No similar papers found.