๐ค AI Summary
This work addresses the limitations of traditional speech emotion recognition, which typically relies on single-label annotations and overlooks the inherent ambiguity of affective states. While existing fuzzy-label approaches attempt to model emotional uncertainty, they are constrained by sparse and unreliable human-annotated probability distributions. To overcome this bottleneck, the study proposes a novel framework that leverages large audio language models (ALMs) to generate high-quality synthetic annotations, constructing a โsynthetic perception agentโ to augment limited human labels. Furthermore, it introduces a Distribution-aware Multimodal Emotion Augmentation strategy (DiME-Aug) for both model fine-tuning and unbiased evaluation. Experiments on IEMOCAP and MSP-Podcast demonstrate that the proposed method significantly improves the modeling of emotion distributions in low-ambiguity regions, validating the efficacy of synthetic annotations in alleviating data scarcity and annotation bias.
๐ Abstract
Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions. Ambiguous Emotion Recognition addresses this by representing emotions as probability distributions, but progress is limited by unreliable ground-truth distributions inferred from sparse human annotations. This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations. We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability. We validate these proxies through statistical analysis of their alignment with human distributions and evaluate their impact by fine-tuning ALMs with the augmented emotion distributions. Furthermore, to address class imbalance and enable unbiased evaluation, we propose DiME-Aug, a Distribution-aware Multimodal Emotion Augmentation strategy. Experiments on IEMOCAP and MSP-Podcast show that synthetic annotations enhance emotion distribution, especially in low-ambiguity regions where annotation agreement is high. However, benefits diminish for highly ambiguous emotions with greater human disagreement. This work provides the first evidence that ALMs could address annotation scarcity in ambiguous emotion recognition, but highlights the need for more advanced prompting or generation strategies to handle highly ambiguous cases.