MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses continuous emotion recognition in natural dynamic scenes, grounded in the two-dimensional Valence-Arousal (V-A) psychological model and leveraging multimodal (visual, audio, textual) inputs. The proposed method introduces three key innovations: (1) a novel six-way bidirectional cross-modal attention mechanism enabling fine-grained inter-modal alignment; (2) a polar-coordinate representation to jointly encode the periodicity and correlation inherent in the V-A space; and (3) a two-stage feature optimization strategy integrating cross-modal enhancement and intra-modal self-attention. Evaluated on the Aff-Wild2 benchmark using Concordance Correlation Coefficient (CCC) as the primary metric, the approach achieves state-of-the-art performance, demonstrating substantial improvements in both accuracy and robustness for emotion modeling in complex conversational videos.

Technology Category

Application Category

📝 Abstract
This paper introduces MAVEN (Multi-modal Attention for Valence-Arousal Emotion Network), a novel architecture for dynamic emotion recognition through dimensional modeling of affect. The model uniquely integrates visual, audio, and textual modalities via a bi-directional cross-modal attention mechanism with six distinct attention pathways, enabling comprehensive interactions between all modality pairs. Our proposed approach employs modality-specific encoders to extract rich feature representations from synchronized video frames, audio segments, and transcripts. The architecture's novelty lies in its cross-modal enhancement strategy, where each modality representation is refined through weighted attention from other modalities, followed by self-attention refinement through modality-specific encoders. Rather than directly predicting valence-arousal values, MAVEN predicts emotions in a polar coordinate form, aligning with psychological models of the emotion circumplex. Experimental evaluation on the Aff-Wild2 dataset demonstrates the effectiveness of our approach, with performance measured using Concordance Correlation Coefficient (CCC). The multi-stage architecture demonstrates superior ability to capture the complex, nuanced nature of emotional expressions in conversational videos, advancing the state-of-the-art (SOTA) in continuous emotion recognition in-the-wild. Code can be found at: https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW.
Problem

Research questions and friction points this paper is trying to address.

Dynamic emotion recognition using multi-modal data integration
Cross-modal attention mechanism for enhanced emotion feature extraction
Advancing continuous emotion recognition in conversational videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-directional cross-modal attention mechanism
Modality-specific encoders for feature extraction
Polar coordinate emotion prediction alignment
🔎 Similar Papers
No similar papers found.
Vrushank Ahire
Vrushank Ahire
B.Tech Undergraduate, Indian Institute of Technology Ropar
Deep LearningAffective ComputingMachine LearningASR
Kunal Shah
Kunal Shah
Stanford University
Robotics
M
Mudasir Nazir Khan
Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Punjab, India
N
Nikhil Pakhale
Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Punjab, India
Lownish Rai Sookha
Lownish Rai Sookha
PhD Student, IIT Ropar
M
M. A. Ganaie
Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Punjab, India
Abhinav Dhall
Abhinav Dhall
Associate Professor, Monash University
Affective computingComputer VisionHuman-Centered AI