🤖 AI Summary
This study investigates whether widely adopted emotion embedding similarity metrics—such as those based on emotion2vec—genuinely reflect affective expressiveness in speech synthesis evaluation. By constructing adversarial voice samples and conducting human subjective listening experiments, the work reveals for the first time that such metrics in zero-shot emotional speech assessment are highly susceptible to interference from linguistic content and speaker identity, leading to significant divergence from human judgments. The findings demonstrate that emotion embeddings achieving high classification accuracy are ill-suited for similarity-based evaluation, as they tend to reward acoustic mimicry rather than authentic emotional expression. This research issues a critical caution against prevailing automatic evaluation paradigms and points toward more perceptually grounded directions for future benchmarking of emotional speech generation systems.
📝 Abstract
Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.