Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech emotion recognition (SER) faces dual challenges of high emotional complexity and scarce labeled data. To address these, we propose a multi-loss collaborative learning framework that innovatively integrates SNR-driven energy-adaptive Mixup data augmentation with frame-level attention mechanisms, significantly enhancing the model’s capacity to capture subtle emotional variations and improve feature discriminability. The framework jointly optimizes KL divergence, Focal Loss, Center Loss, and supervised contrastive loss to mitigate class imbalance and strengthen inter-class separation. Extensive experiments on four benchmark datasets—IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE—demonstrate state-of-the-art performance across all benchmarks, validating the method’s robustness and generalization capability.

Technology Category

Application Category

📝 Abstract
Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves state-of-the-art performance, suggesting its effectiveness and robustness.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speech emotion recognition with multi-loss learning
Addressing emotional complexity and data scarcity in SER
Improving feature extraction and class separability in SER
Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy-adaptive mixup for SNR-based data augmentation
Frame-level attention module for multi-frame feature extraction
Multi-loss learning combining four loss functions for optimization
🔎 Similar Papers
No similar papers found.
C
Cong Wang
Beijing University of Posts and Telecommunications, Beijing, China
Yizhong Geng
Yizhong Geng
Beijing University of Posts and Telecommunications
TTSVCMultimodal
Y
Yuhua Wen
Beijing University of Posts and Telecommunications, Beijing, China
Q
Qifei Li
Beijing University of Posts and Telecommunications, Beijing, China
Yingming Gao
Yingming Gao
Beijing University of Posts and Telecommunications
Computer Assisted Language LearningAcoustic Phonetics and Speech Synthesis
R
Ruimin Wang
Li Auto
Chunfeng Wang
Chunfeng Wang
ByteDance Inc.
no reference video quality assessmentspeech synthesiscomputer visionmachine learning
H
Hao Li
Li Auto
Y
Ya Li
Beijing University of Posts and Telecommunications, Beijing, China
W
Wei Chen
Li Auto