๐ค AI Summary
Traditional speaker embeddings, optimized for speaker identification, excessively compress intra-speaker variability, leading to inadequate prosody and emotion modeling and reduced naturalness in speech synthesis. To address this, we propose Sub-Center Speaker Embedding (SCSE), the first approach to replace single-class centers with multiple class-specific sub-centers in embedding learningโthereby explicitly modeling speech variability while preserving identification accuracy. Our method integrates a sub-center loss function, a multi-head classification layer, and an end-to-end differentiable speech synthesis or conversion framework. Experiments on voice conversion demonstrate that SCSE improves Mean Opinion Score (MOS) by 0.4 points and increases F0 dynamic range by 23%, significantly enhancing prosodic richness and overall speech naturalness.
๐ Abstract
In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings. The proposed approach introduces variations in the speaker embedding while retaining the speaker recognition performance since model does not have to map all of the utterances of a speaker into a single class center. We apply our proposed embedding in voice conversion task and show that our method provides better naturalness and prosody in synthesized speech.