🤖 AI Summary
To address weak cross-modal interaction and imbalanced modality contributions in multimodal emotion recognition (MER), this paper proposes a dynamic enhancement and heterogeneous graph co-modeling framework. Methodologically, it introduces: (1) a modality-specific dynamic feature enhancement module that adaptively calibrates representation strength per modality; (2) a heterogeneous cross-modal graph structure explicitly encoding asymmetric semantic relationships among text, audio, and visual modalities; and (3) a cross-modal cross-attention mechanism to improve fine-grained semantic alignment and emotion reasoning. The framework is trained end-to-end, incorporating strategies to mitigate class imbalance. Extensive experiments on MELD and IEMOCAP demonstrate significant improvements over state-of-the-art methods in both accuracy and weighted F1-score, validating the effectiveness and robustness of the proposed cross-modal co-modeling approach.
📝 Abstract
Multimodal emotion recognition (MER) is crucial for enabling emotionally intelligent systems that perceive and respond to human emotions. However, existing methods suffer from limited cross-modal interaction and imbalanced contributions across modalities. To address these issues, we propose Sync-TVA, an end-to-end graph-attention framework featuring modality-specific dynamic enhancement and structured cross-modal fusion. Our design incorporates a dynamic enhancement module for each modality and constructs heterogeneous cross-modal graphs to model semantic relations across text, audio, and visual features. A cross-attention fusion mechanism further aligns multimodal cues for robust emotion inference. Experiments on MELD and IEMOCAP demonstrate consistent improvements over state-of-the-art models in both accuracy and weighted F1 score, especially under class-imbalanced conditions.