A Mixture-of-Experts model for multimodal emotion recognition in conversations

📅 2026-02-26

🏛️ Computer Speech & Language

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenges of temporal modeling and cross-modal fusion in multimodal conversational emotion recognition by proposing the MiSTER-E framework, which achieves effective emotion recognition without relying on speaker identity. The approach decouples modality-specific modeling from multimodal fusion, employing a Mixture-of-Experts (MoE) architecture to separately process speech and text, and dynamically integrates them via a learnable gating mechanism. Representation quality is further enhanced through context enrichment, alignment strategies, supervised contrastive loss, and KL divergence regularization. Embeddings are extracted using a fine-tuned large language model, and temporal context is captured via a convolutional-recurrent network. The method achieves weighted F1 scores of 70.9%, 69.5%, and 87.9% on the IEMOCAP, MELD, and MOSI datasets, respectively, significantly outperforming existing baselines.

Technology Category

Application Category

📝 Abstract

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

Problem

Research questions and friction points this paper is trying to address.

Emotion Recognition in Conversations

Multimodal Fusion

Modality-specific Context Modeling

Speech-Text Integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Multimodal Emotion Recognition

Large Language Models

Supervised Contrastive Learning