SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection

📅 2024-08-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the high model complexity and substantial data/compute requirements in multi-speaker text-to-speech (TTS) for zero-shot synthesis of unseen speakers, this paper proposes SelectTTS—a speaker-embedding-free, first-frame selection-based TTS framework. Its core innovation lies in leveraging discrete speech units and frame-level self-supervised features (e.g., from Whisper or WavLM) to retrieve salient frames from a few target-speaker utterances via a non-parametric frame selection mechanism, followed by lightweight autoregressive decoding. Crucially, SelectTTS eliminates all speaker-specific parameters. It achieves competitive speech naturalness (MOS 4.1) and speaker similarity (SIM 0.89), while reducing model parameters by over 8× and training data requirements by 270× compared to prior methods. Its performance matches or surpasses that of XTTS-v2 and VALL-E.

Technology Category

Application Category

📝 Abstract

Synthesizing the voices of unseen speakers remains a persisting challenge in multi-speaker text-to-speech (TTS). Existing methods model speaker characteristics through speaker conditioning during training, leading to increased model complexity and limiting reproducibility and accessibility. A lower-complexity method would enable speech synthesis research with limited computational and data resources to reach to a wider use. To this end, we propose SelectTTS, a simple and effective alternative. SelectTTS selects appropriate frames from the target speaker and decodes them using frame-level self-supervised learning (SSL) features. We demonstrate that this approach can effectively capture speaker characteristics for unseen speakers and achieves performance comparable to state-of-the-art multi-speaker TTS frameworks on both objective and subjective metrics. By directly selecting frames from the target speaker's speech, SelectTTS enables generalization to unseen speakers with significantly lower model complexity. Compared to baselines such as XTTS-v2 and VALL-E, SelectTTS achieves better speaker similarity while reducing model parameters by over 8x and training data requirements by 270x.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing voices of unseen speakers in multi-speaker TTS

Reducing model complexity and data requirements for speech synthesis

Improving speaker similarity with lower computational resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses discrete unit-based frame selection

Leverages frame-level SSL features

Reduces model complexity significantly

🔎 Similar Papers

No similar papers found.