Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses two key challenges in speech emotion recognition: label uncertainty arising from annotator disagreement and poor generalization across speakers and acoustic conditions. To tackle these, we propose a probabilistic modeling paradigm that replaces scalar consensus labels with probability density functions (PDFs) of emotion ratings as supervision targets, coupled with saliency-weighted foundation model representations to enhance robustness. Our contributions include: (i) a novel multi-test-set evaluation framework with fine-grained stratification by speaker and gender; and (ii) advocacy of Top-k hypothesis-based robust evaluation—specifically, Top-2/Top-3 accuracy—to mitigate overreliance on overall accuracy. Experiments on benchmark datasets demonstrate state-of-the-art performance. Crucially, ablation reveals that conventional overall accuracy masks significant performance degradation across speakers, whereas Top-k metrics substantially improve result reliability and diagnostic validity.

Technology Category

Application Category

📝 Abstract
Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.
Problem

Research questions and friction points this paper is trying to address.

Addressing label uncertainty in speech emotion grading
Improving emotion recognition across speakers and conditions
Evaluating model performance with label uncertainty and data-skew
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using probability density for emotion grade targets
Saliency driven foundation model representation selection
Evaluating models across multiple test-sets and demographics
🔎 Similar Papers
No similar papers found.