Speech Emotion Recognition With ASR Transcripts: a Comprehensive Study on Word Error Rate and Fusion Techniques

📅 2024-06-12

🏛️ Spoken Language Technology Workshop

📈 Citations: 8

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the significant performance degradation of speech emotion recognition (SER) under word error rates (WER) inherent in automatic speech recognition (ASR) transcriptions. We systematically evaluate the impact of 11 ASR models on both unimodal text-based and audio-text multimodal SER across three benchmark corpora. To enhance robustness, we propose an ASR-error-robust framework integrating lightweight textual error correction with a learnable modality gating mechanism, supporting six multimodal fusion strategies. Experimental results demonstrate that our method maintains strong SER accuracy even under high WER conditions, outperforming the best baseline relying on raw ASR transcripts—e.g., on IEMOCAP. This is the first study to empirically validate the feasibility and effectiveness of end-to-end SER systems tailored to realistic, noisy ASR outputs.

Technology Category

Application Category

📝 Abstract

Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes both text-only and bimodal SER with six fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. These findings provide insights into SER with ASR assistance, especially for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking SER performance using ASR transcripts with varying WERs

Investigating text-only and bimodal SER with six fusion techniques

Proposing an ASR error-robust framework for improved SER results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarks SER using ASR transcripts with varying WERs

Proposes ASR error-robust framework with error correction

Integrates modality-gated fusion for improved SER performance

🔎 Similar Papers

No similar papers found.