EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing evaluation methods for emotional speech captioning struggle to accurately capture fine-grained semantic and emotional details in long-context scenarios: traditional n-gram metrics disregard semantic content, while large language model (LLM)-based scoring is prone to inconsistencies in reasoning and context collapse. To address this, this work proposes EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic-level verification. EmoSURA decomposes captions into atomic perceptual units and validates each against the original speech signal via an audio-anchoring mechanism. Integrated with LLM-assisted parsing and a hierarchically balanced, standardized benchmark—SURABench—EmoSURA substantially outperforms conventional metrics, which exhibit negative correlation due to length sensitivity. The framework demonstrates strong positive correlation with human judgments, offering a more reliable and fine-grained evaluation approach for emotional speech captioning in extended contexts.

Technology Category

Application Category

📝 Abstract

Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.

Problem

Research questions and friction points this paper is trying to address.

emotional speech captioning

evaluation metrics

long-context captions

semantic nuance

context collapse

Innovation

Methods, ideas, or system contributions that make the work stand out.

EmoSURA

atomic verification

emotional speech captioning