MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Speech emotion recognition (SER) in naturalistic scenarios faces challenges including emotional subjectivity, class imbalance, and ambiguous decision boundaries. To address these, we propose DeepSER—a four-stage multimodal deep fusion framework. First, it leverages self-supervised pretrained acoustic and linguistic representations. Second, a cross-modal Transformer aligns heterogeneous modalities. Third, human-annotated emotion scores are incorporated as soft labels, augmented via manifold-aware MixUp to enhance robustness. Fourth, a meta-classifier ensemble, combined with multi-task balanced sampling, refines final predictions. DeepSER innovatively integrates soft-target knowledge distillation, self-supervised representation learning, and geometric data augmentation. Evaluated on Task 1 of the Interspeech 2025 Natural SER Challenge, it achieves first place globally, demonstrating state-of-the-art performance in realistic, unconstrained SER settings.

Technology Category

Application Category

📝 Abstract

SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

Problem

Research questions and friction points this paper is trying to address.

Addresses class imbalance in speech emotion recognition

Handles emotion ambiguity in naturalistic conditions

Improves multimodal fusion for emotion classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal deep fusion transformer for SER

Four-stage training with ensemble classifiers

Meta-classifier optimized with soft targets

🔎 Similar Papers

No similar papers found.