Testing Correctness, Fairness, and Robustness of Speech Emotion Recognition Models

📅 2023-12-11
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fragmented evaluation of Speech Emotion Recognition (SER) models by proposing the first multi-dimensional testing framework jointly assessing correctness, fairness, and robustness. Methodologically, it introduces a task-configurable testing paradigm, designs a data-driven mechanism for automatic fairness-threshold calibration, and identifies a prevalent “textual sentiment shortcut” bias in SER models—where models erroneously rely on textual cues rather than acoustic features. Experiments evaluate ten models—including xLSTM, Transformer-based acoustic models, and CNN baselines—across four SER tasks (arousal, valence, dominance, and categorical emotion) using multi-metric assessment. Results reveal substantial fairness disparities among high-recall models and severe performance degradation for text-dependent models under clean acoustic conditions. This work establishes a systematic, empirically grounded benchmark and toolkit for trustworthy SER model evaluation.
📝 Abstract
Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper introduces a testing framework to investigate behaviour of speech emotion recognition models, by requiring different metrics to reach a certain threshold in order to pass a test. The test metrics can be grouped in terms of correctness, fairness, and robustness. It also provides a method for automatically specifying test thresholds for fairness tests, based on the datasets used, and recommendations on how to select the remaining test thresholds. We evaluated a xLSTM-based and nine transformer-based acoustic foundation models against a convolutional baseline model, testing their performance on arousal, valence, dominance, and emotional category classification. The test results highlight, that models with high correlation or recall might rely on shortcuts -- such as text sentiment --, and differ in terms of fairness.
Problem

Research questions and friction points this paper is trying to address.

Evaluates speech emotion recognition model accuracy
Ensures fairness in emotion recognition models
Tests robustness of emotion recognition systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Testing framework for SER models
Automatic fairness test thresholds
Evaluation of xLSTM and transformers
🔎 Similar Papers
No similar papers found.
A
Anna Derington
audEERING GmbH, Gilching, Germany
H
H. Wierstorf
audEERING GmbH, Gilching, Germany
A
Ali Ozkil
Jabra, GN Audio, Copenhagen, Denmark
F
F. Eyben
audEERING GmbH, Gilching, Germany
Felix Burkhardt
Felix Burkhardt
audEERING
Speech and language processing
B
Björn W. Schuller
audEERING GmbH, Gilching, Germany; CHI – Chair of Health Informatics, Technical University of Munich, Germany; GLAM – Group on Language, Audio, & Music, Imperial College London, UK