MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

πŸ“… 2025-06-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Speech emotion recognition (SER) in naturalistic scenarios faces challenges including emotional subjectivity, class imbalance, and ambiguous decision boundaries. To address these, we propose DeepSERβ€”a four-stage multimodal deep fusion framework. First, it leverages self-supervised pretrained acoustic and linguistic representations. Second, a cross-modal Transformer aligns heterogeneous modalities. Third, human-annotated emotion scores are incorporated as soft labels, augmented via manifold-aware MixUp to enhance robustness. Fourth, a meta-classifier ensemble, combined with multi-task balanced sampling, refines final predictions. DeepSER innovatively integrates soft-target knowledge distillation, self-supervised representation learning, and geometric data augmentation. Evaluated on Task 1 of the Interspeech 2025 Natural SER Challenge, it achieves first place globally, demonstrating state-of-the-art performance in realistic, unconstrained SER settings.

Technology Category

Application Category

πŸ“ Abstract
SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.
Problem

Research questions and friction points this paper is trying to address.

Addresses class imbalance in speech emotion recognition
Handles emotion ambiguity in naturalistic conditions
Improves multimodal fusion for emotion classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal deep fusion transformer for SER
Four-stage training with ensemble classifiers
Meta-classifier optimized with soft targets
πŸ”Ž Similar Papers
No similar papers found.
G
Georgios Chatzichristodoulou
ECE, National Technical University of Athens, Greece
D
Despoina Kosmopoulou
ECE, National Technical University of Athens, Greece; Archimedes, Athena RC, Greece
A
Antonios Kritikos
ECE, National Technical University of Athens, Greece
A
Anastasia Poulopoulou
ECE, National Technical University of Athens, Greece
Efthymios Georgiou
Efthymios Georgiou
Postdoc, University of Bern | Ex. NTUA, AthenaRC
Athanasios Katsamanis
Athanasios Katsamanis
CTO and co-founder, Auxilis AI and Principal Researcher, ILSP, Athena Research Center
conversational AIbehavioral informaticsspeech processingmultimodal signal processing
Vassilis Katsouros
Vassilis Katsouros
Athena Research Center - Institute for Language and Speech Processing
Alexandros Potamianos
Alexandros Potamianos
National Technical University of Athens
speech processingnatural language processingsignal processingdialogue