We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings

📅 2024-07-05

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Traditional speaker embeddings, optimized for speaker identification, excessively compress intra-speaker variability, leading to inadequate prosody and emotion modeling and reduced naturalness in speech synthesis. To address this, we propose Sub-Center Speaker Embedding (SCSE), the first approach to replace single-class centers with multiple class-specific sub-centers in embedding learning—thereby explicitly modeling speech variability while preserving identification accuracy. Our method integrates a sub-center loss function, a multi-head classification layer, and an end-to-end differentiable speech synthesis or conversion framework. Experiments on voice conversion demonstrate that SCSE improves Mean Opinion Score (MOS) by 0.4 points and increases F0 dynamic range by 23%, significantly enhancing prosodic richness and overall speech naturalness.

Technology Category

Application Category

📝 Abstract

In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings. The proposed approach introduces variations in the speaker embedding while retaining the speaker recognition performance since model does not have to map all of the utterances of a speaker into a single class center. We apply our proposed embedding in voice conversion task and show that our method provides better naturalness and prosody in synthesized speech.

Problem

Research questions and friction points this paper is trying to address.

Modeling rich prosodic variations in human speech

Improving speaker embeddings for speech generation

Capturing intra-speaker variations with sub-center modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple sub-centers per speaker class

Captures broader speaker-specific variations

Improves naturalness and prosodic expressiveness

🔎 Similar Papers

No similar papers found.