VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of a unified evaluation benchmark for speech large language models (LLMs) in emotional recognition and the neglect of emotion’s inherent subjectivity and ambiguity in conventional approaches. To this end, the authors construct VoxEmo, a large-scale benchmark spanning 15 languages and 35 corpora, and introduce a distribution-aware soft-labeling protocol alongside multi-level prompting templates to enable diverse evaluations—from categorical classification to paralinguistic reasoning. By integrating generative interfaces, multi-complexity prompt designs, and prompt ensembling strategies, experiments demonstrate that while zero-shot speech LLMs underperform supervised models in hard-label accuracy, their predicted emotion distributions align more closely with human subjective judgments, effectively capturing the uncertainty intrinsic to emotional expression.

Technology Category

Application Category

📝 Abstract
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.
Problem

Research questions and friction points this paper is trying to address.

speech emotion recognition
speech LLMs
zero-shot stochasticity
emotion ambiguity
benchmark evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech LLMs
emotion recognition
soft-label protocol
prompt ensemble
multilingual benchmark
🔎 Similar Papers
2024-09-23IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 1