🤖 AI Summary
To address two key challenges in multimodal emotion recognition (MER)—weak modality-specific feature representation and difficulty in modeling cross-modal semantic similarity due to modality heterogeneity—this paper proposes a novel framework integrating gated inter-modal attention with modality-invariant representation learning. A gated mechanism dynamically models pairwise emotional interactions among modalities to enhance modality-specific feature extraction, while a modality-invariant generator aligns cross-modal semantic distributions under adversarial domain alignment constraints. Evaluated on IEMOCAP, the method achieves 80.7% weighted accuracy and 81.3% unweighted accuracy, outperforming state-of-the-art approaches. The core contributions are: (1) an interpretable, gated mechanism for modeling inter-modal emotional interactions; (2) an explicit strategy for cross-modal distribution alignment via adversarial learning; and (3) an end-to-end trainable multimodal fusion architecture.
📝 Abstract
Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction. Attention-based fusion methods dominate MER research, achieving strong classification performance. However, two key challenges remain: effectively extracting modality-specific features and capturing cross-modal similarities despite distribution differences caused by modality heterogeneity. To address these, we propose a gated interactive attention mechanism to adaptively extract modality-specific features while enhancing emotional information through pairwise interactions. Additionally, we introduce a modality-invariant generator to learn modality-invariant representations and constrain domain shifts by aligning cross-modal similarities. Experiments on IEMOCAP demonstrate that our method outperforms state-of-the-art MER approaches, achieving WA 80.7% and UA 81.3%.