Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the limitations of traditional speech emotion recognition, which typically relies on single-label annotations and overlooks the inherent ambiguity of affective states. While existing fuzzy-label approaches attempt to model emotional uncertainty, they are constrained by sparse and unreliable human-annotated probability distributions. To overcome this bottleneck, the study proposes a novel framework that leverages large audio language models (ALMs) to generate high-quality synthetic annotations, constructing a “synthetic perception agent” to augment limited human labels. Furthermore, it introduces a Distribution-aware Multimodal Emotion Augmentation strategy (DiME-Aug) for both model fine-tuning and unbiased evaluation. Experiments on IEMOCAP and MSP-Podcast demonstrate that the proposed method significantly improves the modeling of emotion distributions in low-ambiguity regions, validating the efficacy of synthetic annotations in alleviating data scarcity and annotation bias.

Technology Category

Application Category

📝 Abstract

Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions. Ambiguous Emotion Recognition addresses this by representing emotions as probability distributions, but progress is limited by unreliable ground-truth distributions inferred from sparse human annotations. This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations. We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability. We validate these proxies through statistical analysis of their alignment with human distributions and evaluate their impact by fine-tuning ALMs with the augmented emotion distributions. Furthermore, to address class imbalance and enable unbiased evaluation, we propose DiME-Aug, a Distribution-aware Multimodal Emotion Augmentation strategy. Experiments on IEMOCAP and MSP-Podcast show that synthetic annotations enhance emotion distribution, especially in low-ambiguity regions where annotation agreement is high. However, benefits diminish for highly ambiguous emotions with greater human disagreement. This work provides the first evidence that ALMs could address annotation scarcity in ambiguous emotion recognition, but highlights the need for more advanced prompting or generation strategies to handle highly ambiguous cases.

Problem

Research questions and friction points this paper is trying to address.

Speech Emotion Recognition

Emotion Ambiguity

Annotation Bottleneck

Ground-truth Distribution

Human Annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Language Models

Ambiguous Emotion Recognition

Synthetic Annotation