🤖 AI Summary
Existing evaluation methods for emotional speech captioning struggle to accurately capture fine-grained semantic and emotional details in long-context scenarios: traditional n-gram metrics disregard semantic content, while large language model (LLM)-based scoring is prone to inconsistencies in reasoning and context collapse. To address this, this work proposes EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic-level verification. EmoSURA decomposes captions into atomic perceptual units and validates each against the original speech signal via an audio-anchoring mechanism. Integrated with LLM-assisted parsing and a hierarchically balanced, standardized benchmark—SURABench—EmoSURA substantially outperforms conventional metrics, which exhibit negative correlation due to length sensitivity. The framework demonstrates strong positive correlation with human judgments, offering a more reliable and fine-grained evaluation approach for emotional speech captioning in extended contexts.
📝 Abstract
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.