Charting 15 years of progress in deep learning for speech emotion recognition: A replication study

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Despite rapid advances in deep learning, the actual progress of speech emotion recognition (SER) remains unclear due to inconsistent evaluation protocols and unverified claims of improvement. Method: We systematically reproduce and uniformly evaluate 15 years (2009–2024) of representative models—including CNNs, RNNs, and Transformers—under identical end-to-end multimodal (audio-text) training and standardized benchmarking on mainstream datasets. Contribution/Results: Our analysis reveals performance saturation post-Transformer adoption, with marginal gains diminishing significantly; moreover, performance gaps among state-of-the-art models are inflated by current evaluation paradigms. Fundamental bottlenecks—including dataset bias, label noise, and ineffective multimodal fusion—persistently constrain progress. This work provides the first empirical evidence of an “illusion of progress” in SER, proposing a standardized evaluation framework and a bottleneck attribution methodology to guide rigorous, reproducible future research.

Technology Category

Application Category

📝 Abstract

Speech emotion recognition (SER) has long benefited from the adoption of deep learning methodologies. Deeper models -- with more layers and more trainable parameters -- are generally perceived as being `better' by the SER community. This raises the question -- emph{how much better} are modern-era deep neural networks compared to their earlier iterations? Beyond that, the more important question of how to move forward remains as poignant as ever. SER is far from a solved problem; therefore, identifying the most prominent avenues of future research is of paramount importance. In the present contribution, we attempt a quantification of progress in the 15 years of research beginning with the introduction of the landmark 2009 INTERSPEECH Emotion Challenge. We conduct a large scale investigation of model architectures, spanning both audio-based models that rely on speech inputs and text-baed models that rely solely on transcriptions. Our results point towards diminishing returns and a plateau after the recent introduction of transformer architectures. Moreover, we demonstrate how perceptions of progress are conditioned on the particular selection of models that are compared. Our findings have important repercussions about the state-of-the-art in SER research and the paths forward

Problem

Research questions and friction points this paper is trying to address.

Evaluate progress in deep learning for speech emotion recognition over 15 years

Compare modern deep neural networks with earlier iterations for SER

Identify key future research directions in speech emotion recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep learning models with more layers

Comparison of modern and early neural networks

Investigation of audio and text-based models

🔎 Similar Papers

The Face of Populism: Examining Differences in Facial Emotional Expressions of Political Leaders Using Machine Learning