π€ AI Summary
Speech emotion recognition (SER) in naturalistic scenarios faces challenges including emotional subjectivity, class imbalance, and ambiguous decision boundaries. To address these, we propose DeepSERβa four-stage multimodal deep fusion framework. First, it leverages self-supervised pretrained acoustic and linguistic representations. Second, a cross-modal Transformer aligns heterogeneous modalities. Third, human-annotated emotion scores are incorporated as soft labels, augmented via manifold-aware MixUp to enhance robustness. Fourth, a meta-classifier ensemble, combined with multi-task balanced sampling, refines final predictions. DeepSER innovatively integrates soft-target knowledge distillation, self-supervised representation learning, and geometric data augmentation. Evaluated on Task 1 of the Interspeech 2025 Natural SER Challenge, it achieves first place globally, demonstrating state-of-the-art performance in realistic, unconstrained SER settings.
π Abstract
SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.