VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This study addresses the lack of a unified evaluation benchmark for speech large language models (LLMs) in emotional recognition and the neglect of emotion’s inherent subjectivity and ambiguity in conventional approaches. To this end, the authors construct VoxEmo, a large-scale benchmark spanning 15 languages and 35 corpora, and introduce a distribution-aware soft-labeling protocol alongside multi-level prompting templates to enable diverse evaluations—from categorical classification to paralinguistic reasoning. By integrating generative interfaces, multi-complexity prompt designs, and prompt ensembling strategies, experiments demonstrate that while zero-shot speech LLMs underperform supervised models in hard-label accuracy, their predicted emotion distributions align more closely with human subjective judgments, effectively capturing the uncertainty intrinsic to emotional expression.

Technology Category

Application Category

📝 Abstract

Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.

Problem

Research questions and friction points this paper is trying to address.

speech emotion recognition

speech LLMs

zero-shot stochasticity

emotion ambiguity

benchmark evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech LLMs

emotion recognition

soft-label protocol

prompt ensemble

multilingual benchmark

🔎 Similar Papers

Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction

2024-09-23IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 1