Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting

📅 2024-12-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing few-shot and noisy-condition speaker adaptation TTS approaches suffer from prosodic distortion, timbre degradation, and severe overfitting. To address these issues, this paper proposes a robustness-enhanced framework comprising three key components: (1) a high-quality prior-guided prosody encoder incorporating a novel prosody prompting mechanism; (2) a prior-preserving loss function jointly optimizing prosodic consistency and timbre fidelity; and (3) a diffusion-based architecture integrated with prior-sample distillation for low-resource fine-tuning. Evaluated under extremely challenging conditions—only one minute of target speech and noisy training data—the method achieves a MOS improvement of over 0.5 and a 22% gain in prosody accuracy. It effectively mitigates overfitting and, for the first time, enables concurrent robust modeling of prosody and timbre in few-shot and noisy scenarios.

Technology Category

Application Category

📝 Abstract

Speaker-adaptive Text-to-Speech (TTS) synthesis has attracted considerable attention due to its broad range of applications, such as personalized voice assistant services. While several approaches have been proposed, they often exhibit high sensitivity to either the quantity or the quality of target speech samples. To address these limitations, we introduce Stable-TTS, a novel speaker-adaptive TTS framework that leverages a small subset of a high-quality pre-training dataset, referred to as prior samples. Specifically, Stable-TTS achieves prosody consistency by leveraging the high-quality prosody of prior samples, while effectively capturing the timbre of the target speaker. Additionally, it employs a prior-preservation loss during fine-tuning to maintain the synthesis ability for prior samples to prevent overfitting on target samples. Extensive experiments demonstrate the effectiveness of Stable-TTS even under limited amounts of and noisy target speech samples.

Problem

Research questions and friction points this paper is trying to address.

Adaptive Text-to-Speech

Voice Sample Quality

Speech Synthesis Naturalness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Text-to-Speech

Prior Sample Utilization

Stable Voice Replication

🔎 Similar Papers

No similar papers found.

Authors to Follow