🤖 AI Summary
To address poor data quality and low-frequency emotion class recognition in Multimodal Emotion Recognition in Conversations (MERC), this paper proposes a trimodal fusion method integrating speaker-identity-aware transfer learning with the MAMBA architecture. First, we establish a systematic data quality validation pipeline. Second, we extract audio-visual speaker identity embeddings using RecoMadeEasy® and model textual sentiment representations via MPNet-v2; an emotion-specific MLP collaborates with MAMBA to capture dynamic cross-modal dependencies. Third, speaker and facial identity features are explicitly transferred to model individual differences in emotional expression, significantly enhancing discriminability for sparse emotion classes. Evaluated on MELD and IEMOCAP, our method achieves 64.8% and 74.3% accuracy, respectively—outperforming state-of-the-art approaches. Key contributions include: (1) a principled data quality assessment framework; (2) identity-aware multimodal representation learning with MAMBA-based dynamic fusion; and (3) improved generalization to infrequent emotion categories through speaker-identity transfer.
📝 Abstract
This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.