We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings

๐Ÿ“… 2024-07-05
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Traditional speaker embeddings, optimized for speaker identification, excessively compress intra-speaker variability, leading to inadequate prosody and emotion modeling and reduced naturalness in speech synthesis. To address this, we propose Sub-Center Speaker Embedding (SCSE), the first approach to replace single-class centers with multiple class-specific sub-centers in embedding learningโ€”thereby explicitly modeling speech variability while preserving identification accuracy. Our method integrates a sub-center loss function, a multi-head classification layer, and an end-to-end differentiable speech synthesis or conversion framework. Experiments on voice conversion demonstrate that SCSE improves Mean Opinion Score (MOS) by 0.4 points and increases F0 dynamic range by 23%, significantly enhancing prosodic richness and overall speech naturalness.

Technology Category

Application Category

๐Ÿ“ Abstract
In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings. The proposed approach introduces variations in the speaker embedding while retaining the speaker recognition performance since model does not have to map all of the utterances of a speaker into a single class center. We apply our proposed embedding in voice conversion task and show that our method provides better naturalness and prosody in synthesized speech.
Problem

Research questions and friction points this paper is trying to address.

Modeling rich prosodic variations in human speech
Improving speaker embeddings for speech generation
Capturing intra-speaker variations with sub-center modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple sub-centers per speaker class
Captures broader speaker-specific variations
Improves naturalness and prosodic expressiveness
๐Ÿ”Ž Similar Papers
No similar papers found.
I
Ismail Rasim Ulgen
Electrical and Computer Engineering Department, University of Texas at Dallas, Richardson, TX 75080 USA
C
Carlos Busso
Electrical and Computer Engineering Department, University of Texas at Dallas, Richardson, TX 75080 USA
J
John H. L. Hansen
Electrical and Computer Engineering Department, University of Texas at Dallas, Richardson, TX 75080 USA
Berrak Sisman
Berrak Sisman
Assistant Professor (ECE & DSAI), Johns Hopkins University
Machine LearningAffective ComputingSpeech SynthesisVoice ConversionAnti-spoofing