Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting

📅 2024-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing few-shot and noisy-condition speaker adaptation TTS approaches suffer from prosodic distortion, timbre degradation, and severe overfitting. To address these issues, this paper proposes a robustness-enhanced framework comprising three key components: (1) a high-quality prior-guided prosody encoder incorporating a novel prosody prompting mechanism; (2) a prior-preserving loss function jointly optimizing prosodic consistency and timbre fidelity; and (3) a diffusion-based architecture integrated with prior-sample distillation for low-resource fine-tuning. Evaluated under extremely challenging conditions—only one minute of target speech and noisy training data—the method achieves a MOS improvement of over 0.5 and a 22% gain in prosody accuracy. It effectively mitigates overfitting and, for the first time, enables concurrent robust modeling of prosody and timbre in few-shot and noisy scenarios.

Technology Category

Application Category

📝 Abstract
Speaker-adaptive Text-to-Speech (TTS) synthesis has attracted considerable attention due to its broad range of applications, such as personalized voice assistant services. While several approaches have been proposed, they often exhibit high sensitivity to either the quantity or the quality of target speech samples. To address these limitations, we introduce Stable-TTS, a novel speaker-adaptive TTS framework that leverages a small subset of a high-quality pre-training dataset, referred to as prior samples. Specifically, Stable-TTS achieves prosody consistency by leveraging the high-quality prosody of prior samples, while effectively capturing the timbre of the target speaker. Additionally, it employs a prior-preservation loss during fine-tuning to maintain the synthesis ability for prior samples to prevent overfitting on target samples. Extensive experiments demonstrate the effectiveness of Stable-TTS even under limited amounts of and noisy target speech samples.
Problem

Research questions and friction points this paper is trying to address.

Adaptive Text-to-Speech
Voice Sample Quality
Speech Synthesis Naturalness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Text-to-Speech
Prior Sample Utilization
Stable Voice Replication
🔎 Similar Papers
No similar papers found.