🤖 AI Summary
This study addresses the lack of a unified evaluation benchmark for speech large language models (LLMs) in emotional recognition and the neglect of emotion’s inherent subjectivity and ambiguity in conventional approaches. To this end, the authors construct VoxEmo, a large-scale benchmark spanning 15 languages and 35 corpora, and introduce a distribution-aware soft-labeling protocol alongside multi-level prompting templates to enable diverse evaluations—from categorical classification to paralinguistic reasoning. By integrating generative interfaces, multi-complexity prompt designs, and prompt ensembling strategies, experiments demonstrate that while zero-shot speech LLMs underperform supervised models in hard-label accuracy, their predicted emotion distributions align more closely with human subjective judgments, effectively capturing the uncertainty intrinsic to emotional expression.
📝 Abstract
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.