GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address two key challenges in multimodal emotion recognition (MER)—weak modality-specific feature representation and difficulty in modeling cross-modal semantic similarity due to modality heterogeneity—this paper proposes a novel framework integrating gated inter-modal attention with modality-invariant representation learning. A gated mechanism dynamically models pairwise emotional interactions among modalities to enhance modality-specific feature extraction, while a modality-invariant generator aligns cross-modal semantic distributions under adversarial domain alignment constraints. Evaluated on IEMOCAP, the method achieves 80.7% weighted accuracy and 81.3% unweighted accuracy, outperforming state-of-the-art approaches. The core contributions are: (1) an interpretable, gated mechanism for modeling inter-modal emotional interactions; (2) an explicit strategy for cross-modal distribution alignment via adversarial learning; and (3) an end-to-end trainable multimodal fusion architecture.

Technology Category

Application Category

📝 Abstract

Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction. Attention-based fusion methods dominate MER research, achieving strong classification performance. However, two key challenges remain: effectively extracting modality-specific features and capturing cross-modal similarities despite distribution differences caused by modality heterogeneity. To address these, we propose a gated interactive attention mechanism to adaptively extract modality-specific features while enhancing emotional information through pairwise interactions. Additionally, we introduce a modality-invariant generator to learn modality-invariant representations and constrain domain shifts by aligning cross-modal similarities. Experiments on IEMOCAP demonstrate that our method outperforms state-of-the-art MER approaches, achieving WA 80.7% and UA 81.3%.

Problem

Research questions and friction points this paper is trying to address.

Effectively extracting modality-specific features from multimodal data

Capturing cross-modal similarities despite distribution differences

Improving emotion recognition accuracy in human-computer interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated interactive attention for feature extraction

Modality-invariant generator for cross-modal alignment

Enhanced emotion recognition via pairwise interactions

🔎 Similar Papers

No similar papers found.