Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion

๐Ÿ“… 2025-08-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address inefficiencies in multi-task feature fusion, class imbalance, and semantic confusion in natural-scene speech emotion recognition (SER), this paper proposes a multi-task collaborative learning framework. Methodologically, we design a collaborative attention module that enables context-aware, dynamic feature fusion between the primary emotion classification task and auxiliary tasksโ€”including gender identification, speaker verification, and automatic speech recognition (ASR). We further introduce sample-weighted focal contrastive loss (SWFC) to jointly mitigate class imbalance and inter-class semantic confusion. The framework is end-to-end fine-tuned on self-supervised pretrained models (e.g., wav2vec 2.0) under a multi-task objective. Evaluated on the SER-Naturalistic challenge, our approach achieves significant improvements in emotion classification accuracy and generalization robustness, demonstrating the effectiveness of collaborative task modeling and customized loss design.

Technology Category

Application Category

๐Ÿ“ Abstract
This study investigates fine-tuning self-supervised learn ing (SSL) models using multi-task learning (MTL) to enhance speech emotion recognition (SER). The framework simultane ously handles four related tasks: emotion recognition, gender recognition, speaker verification, and automatic speech recog nition. An innovative co-attention module is introduced to dy namically capture the interactions between features from the primary emotion classification task and auxiliary tasks, en abling context-aware fusion. Moreover, We introduce the Sam ple Weighted Focal Contrastive (SWFC) loss function to ad dress class imbalance and semantic confusion by adjusting sam ple weights for difficult and minority samples. The method is validated on the Categorical Emotion Recognition task of the Speech Emotion Recognition in Naturalistic Conditions Chal lenge, showing significant performance improvements.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speech emotion recognition using multi-task learning
Addressing class imbalance with weighted loss functions
Dynamically fusing features from multiple related tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning SSL models with multi-task learning
Co-attention module for dynamic feature fusion
SWFC loss function addressing class imbalance
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Honghong Wang
Beijing Fosafer Information Technology Co., Ltd., China
J
Jing Deng
Beijing Fosafer Information Technology Co., Ltd., China
Fanqin Meng
Fanqin Meng
School of Automation and Information Engineering, Sichuan University of Science and Engineering
Optimizationtarget trackinginformation fusionmachine learninglocalization
R
Rong Zheng
Beijing Fosafer Information Technology Co., Ltd., China