🤖 AI Summary
Current spoken dialogue systems lack systematic evaluation of affective reasoning capabilities—particularly cross-turn emotional coherence. To address this gap, we propose the first benchmark framework specifically designed for assessing emotional coherence in speech-based dialogues. Our method innovatively introduces a cross-turn affective reasoning scoring mechanism and leverages text-to-speech (TTS) synthesis to generate diverse, high-fidelity spoken evaluation data spanning multiple emotion categories and intensity levels. The framework integrates three complementary metric types: continuous (e.g., emotion intensity trajectory), categorical (e.g., polarity consistency), and perceptual (i.e., human subjective judgments), enabling multidimensional, reproducible assessment. Extensive experiments across seven state-of-the-art dialogue systems reveal prevalent patterns of emotional inconsistency, demonstrating the framework’s effectiveness and generalizability in detecting and quantifying emotional coherence deficits.
📝 Abstract
Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via text-to-speech to simulate diverse emotional states, overcoming the scarcity of emotional speech data. We further propose the Cross-turn Emotion Reasoning Score to assess the emotion transitions in multi-turn dialogues. Evaluating seven dialogue systems through continuous, categorical, and perceptual metrics, we show that our framework effectively detects emotional inconsistencies, providing insights for improving current dialogue systems. By releasing a systematic evaluation benchmark, we aim to advance emotion-aware spoken dialogue modeling toward more natural and adaptive interactions.