Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address weak cross-modal interaction and imbalanced modality contributions in multimodal emotion recognition (MER), this paper proposes a dynamic enhancement and heterogeneous graph co-modeling framework. Methodologically, it introduces: (1) a modality-specific dynamic feature enhancement module that adaptively calibrates representation strength per modality; (2) a heterogeneous cross-modal graph structure explicitly encoding asymmetric semantic relationships among text, audio, and visual modalities; and (3) a cross-modal cross-attention mechanism to improve fine-grained semantic alignment and emotion reasoning. The framework is trained end-to-end, incorporating strategies to mitigate class imbalance. Extensive experiments on MELD and IEMOCAP demonstrate significant improvements over state-of-the-art methods in both accuracy and weighted F1-score, validating the effectiveness and robustness of the proposed cross-modal co-modeling approach.

Technology Category

Application Category

📝 Abstract

Multimodal emotion recognition (MER) is crucial for enabling emotionally intelligent systems that perceive and respond to human emotions. However, existing methods suffer from limited cross-modal interaction and imbalanced contributions across modalities. To address these issues, we propose Sync-TVA, an end-to-end graph-attention framework featuring modality-specific dynamic enhancement and structured cross-modal fusion. Our design incorporates a dynamic enhancement module for each modality and constructs heterogeneous cross-modal graphs to model semantic relations across text, audio, and visual features. A cross-attention fusion mechanism further aligns multimodal cues for robust emotion inference. Experiments on MELD and IEMOCAP demonstrate consistent improvements over state-of-the-art models in both accuracy and weighted F1 score, especially under class-imbalanced conditions.

Problem

Research questions and friction points this paper is trying to address.

Limited cross-modal interaction in emotion recognition.

Imbalanced contributions across different modalities.

Need for robust multimodal emotion inference.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-attention framework for multimodal emotion recognition

Dynamic enhancement module for each modality

Cross-attention fusion mechanism aligns multimodal cues

🔎 Similar Papers

Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition