The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
This study investigates whether widely adopted emotion embedding similarity metrics—such as those based on emotion2vec—genuinely reflect affective expressiveness in speech synthesis evaluation. By constructing adversarial voice samples and conducting human subjective listening experiments, the work reveals for the first time that such metrics in zero-shot emotional speech assessment are highly susceptible to interference from linguistic content and speaker identity, leading to significant divergence from human judgments. The findings demonstrate that emotion embeddings achieving high classification accuracy are ill-suited for similarity-based evaluation, as they tend to reward acoustic mimicry rather than authentic emotional expression. This research issues a critical caution against prevailing automatic evaluation paradigms and points toward more perceptually grounded directions for future benchmarking of emotional speech generation systems.
📝 Abstract
Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.
Problem

Research questions and friction points this paper is trying to address.

emotion embedding
speech generation evaluation
emotional expressiveness
cosine similarity
affective cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

emotion embedding
speech generation evaluation
adversarial analysis
zero-shot similarity
emotional prosody
🔎 Similar Papers
No similar papers found.
Y
Yun-Shao Tsai
Graduate Institute of Communication Engineering, National Taiwan University, Taiwan
Yi-Cheng Lin
Yi-Cheng Lin
National Taiwan University
Speech ProcessingMachine LearningFairness
Huang-Cheng Chou
Huang-Cheng Chou
Postdoctoral Scholar - NSTC Fellow, USC Viterbi School of Engineering (formerly Amazon and Realtek)
Affective ComputingSpoken Language UnderstandingEmotion RecognitionDeception Detection
T
Tzu-Wen Hsu
Gilbert AI Lab, USA
Y
Yun-Man Hsu
Graduate Institute of Electrical Engineering, National Taiwan University, Taiwan
C
Chun Wei Chen
Graduate Institute of Electrical Engineering, National Taiwan University, Taiwan
S
Shrikanth Narayanan
Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, USA
Hung-yi Lee
Hung-yi Lee
National Taiwan University
deep learningspoken language understandingspeech processing