Speech Emotion Recognition With ASR Transcripts: a Comprehensive Study on Word Error Rate and Fusion Techniques

πŸ“… 2024-06-12
πŸ›οΈ Spoken Language Technology Workshop
πŸ“ˆ Citations: 8
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the significant performance degradation of speech emotion recognition (SER) under word error rates (WER) inherent in automatic speech recognition (ASR) transcriptions. We systematically evaluate the impact of 11 ASR models on both unimodal text-based and audio-text multimodal SER across three benchmark corpora. To enhance robustness, we propose an ASR-error-robust framework integrating lightweight textual error correction with a learnable modality gating mechanism, supporting six multimodal fusion strategies. Experimental results demonstrate that our method maintains strong SER accuracy even under high WER conditions, outperforming the best baseline relying on raw ASR transcriptsβ€”e.g., on IEMOCAP. This is the first study to empirically validate the feasibility and effectiveness of end-to-end SER systems tailored to realistic, noisy ASR outputs.

Technology Category

Application Category

πŸ“ Abstract
Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes both text-only and bimodal SER with six fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. These findings provide insights into SER with ASR assistance, especially for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking SER performance using ASR transcripts with varying WERs
Investigating text-only and bimodal SER with six fusion techniques
Proposing an ASR error-robust framework for improved SER results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarks SER using ASR transcripts with varying WERs
Proposes ASR error-robust framework with error correction
Integrates modality-gated fusion for improved SER performance
πŸ”Ž Similar Papers
No similar papers found.