Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor data quality and low-frequency emotion class recognition in Multimodal Emotion Recognition in Conversations (MERC), this paper proposes a trimodal fusion method integrating speaker-identity-aware transfer learning with the MAMBA architecture. First, we establish a systematic data quality validation pipeline. Second, we extract audio-visual speaker identity embeddings using RecoMadeEasy® and model textual sentiment representations via MPNet-v2; an emotion-specific MLP collaborates with MAMBA to capture dynamic cross-modal dependencies. Third, speaker and facial identity features are explicitly transferred to model individual differences in emotional expression, significantly enhancing discriminability for sparse emotion classes. Evaluated on MELD and IEMOCAP, our method achieves 64.8% and 74.3% accuracy, respectively—outperforming state-of-the-art approaches. Key contributions include: (1) a principled data quality assessment framework; (2) identity-aware multimodal representation learning with MAMBA-based dynamic fusion; and (3) improved generalization to infrequent emotion categories through speaker-identity transfer.

Technology Category

Application Category

📝 Abstract
This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.
Problem

Research questions and friction points this paper is trying to address.

Addressing data quality issues in multimodal emotion recognition conversations
Implementing quality control for speaker identity and multimodal alignment
Improving emotion recognition through identity-based transfer learning fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality control pipeline validates multimodal data integrity
Identity-based transfer learning extracts emotion-discriminative features
MAMBA fusion combines tuned unimodal representations for classification
🔎 Similar Papers
No similar papers found.