M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multimodal speech emotion recognition (SER), automatic speech recognition (ASR) transcription errors degrade emotional discriminability. To address this, we propose M4SER—a unified framework that jointly models raw speech and ASR transcripts. M4SER introduces ASR error detection and correction as auxiliary tasks and integrates adversarial training with label-aware contrastive learning to achieve robust cross-modal emotional representation learning. Its core innovation lies in the synergistic integration of multi-representation modeling, multi-task learning, and multi-strategy optimization—enhancing modality-specific feature extraction while strengthening emotion discrimination. Extensive experiments on IEMOCAP and MELD demonstrate that M4SER significantly outperforms state-of-the-art methods in both accuracy and cross-context generalization, validating its effectiveness and robustness against ASR imperfections.

Technology Category

Application Category

📝 Abstract
Multimodal speech emotion recognition (SER) has emerged as pivotal for improving human-machine interaction. Researchers are increasingly leveraging both speech and textual information obtained through automatic speech recognition (ASR) to comprehensively recognize emotional states from speakers. Although this approach reduces reliance on human-annotated text data, ASR errors possibly degrade emotion recognition performance. To address this challenge, in our previous work, we introduced two auxiliary tasks, namely, ASR error detection and ASR error correction, and we proposed a novel multimodal fusion (MF) method for learning modality-specific and modality-invariant representations across different modalities. Building on this foundation, in this paper, we introduce two additional training strategies. First, we propose an adversarial network to enhance the diversity of modality-specific representations. Second, we introduce a label-based contrastive learning strategy to better capture emotional features. We refer to our proposed method as M4SER and validate its superiority over state-of-the-art methods through extensive experiments using IEMOCAP and MELD datasets.
Problem

Research questions and friction points this paper is trying to address.

Addresses ASR errors degrading emotion recognition performance
Improves multimodal learning across speech and text modalities
Enhances emotion feature capture through novel training strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial network enhances modality-specific representation diversity
Label-based contrastive learning captures emotional features better
Multimodal fusion learns modality-specific and invariant representations
🔎 Similar Papers
No similar papers found.
Jiajun He
Jiajun He
PhD Student, University of Cambridge
Probabilistic MethodsMachine Learning
X
Xiaohan Shi
Graduate School of Informatics, Nagoya University, Nagoya 464-8601, Japan
Cheng-Hung Hu
Cheng-Hung Hu
Academia Sinica
Jinyi Mi
Jinyi Mi
Nagoya University
X
Xingfeng Li
Faculty of Data Science, City University of Macau, Macau 999078, China
Tomoki Toda
Tomoki Toda
Nagoya University
Signal ProcessingSpeech ProcessingSpeech Synthesis