On the Emotion Understanding of Synthesized Speech

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited generalization of current speech emotion recognition (SER) models on synthetic speech, which hinders accurate assessment of emotional expressiveness in generated utterances. For the first time, it systematically reveals a representation mismatch in SER models when applied to synthetic speech: these models overly rely on non-robust shortcuts such as textual semantics while neglecting paralinguistic emotional cues. To investigate this issue, the authors integrate both discriminative and generative SER approaches and conduct cross-domain evaluations across multiple datasets using diverse text-to-speech systems. Experimental results demonstrate a significant performance drop of existing SER methods on synthetic speech, underscoring the critical need to model genuine acoustic correlates of emotion. This work establishes a foundational benchmark and provides clear directions for developing more robust emotion recognition systems capable of handling synthetic voices.

Technology Category

Application Category

📝 Abstract
Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.
Problem

Research questions and friction points this paper is trying to address.

Speech Emotion Recognition
Synthesized Speech
Paralinguistic Understanding
Emotion Generalization
Speech Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

speech emotion recognition
synthesized speech
representation mismatch
paralinguistic cues
speech language models
🔎 Similar Papers
No similar papers found.