SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS

📅 2024-08-20

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the challenges of data-intensive annotation requirements and complex training in multi-speaker text-to-speech (TTS) for low-resource languages or domains, this paper proposes SSL-TTS: a lightweight zero-shot multi-speaker TTS framework trainable from speech annotations of only a single speaker. Methodologically, it integrates self-supervised speech representations (wav2vec 2.0), k-nearest-neighbor cross-speaker feature retrieval, and a tunable interpolation-based fine-grained voice blending mechanism, jointly optimizing acoustic modeling and waveform synthesis. Its core contribution is the first demonstration of zero-shot multi-speaker speech synthesis under single-speaker supervision—eliminating the need for multi-speaker annotated data entirely. Objective and subjective evaluations show performance on par with state-of-the-art models relying on large-scale multi-speaker corpora. SSL-TTS significantly lowers the development barrier for TTS in low-resource settings and enables high-fidelity real-time voice cloning and cross-speaker voice mixing.

Technology Category

Application Category

📝 Abstract

While recent zero-shot multispeaker text-to-speech (TTS) models achieve impressive results, they typically rely on extensive transcribed speech datasets from numerous speakers and intricate training pipelines. Meanwhile, self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS. It was also observed that SSL features from different speakers that are linearly close share phonetic information while maintaining individual speaker identity, which enables straight-forward and robust voice cloning. In this study, we introduce SSL-TTS, a lightweight and efficient zero-shot TTS framework trained on transcribed speech from a single speaker. SSL-TTS leverages SSL features and retrieval methods for simple and robust zero-shot multi-speaker synthesis. Objective and subjective evaluations show that our approach achieves performance comparable to state-of-the-art models that require significantly larger training datasets. The low training data requirements mean that SSL-TTS is well suited for the development of multi-speaker TTS systems for low-resource domains and languages. We also introduce an interpolation parameter which enables fine control over the output speech by blending voices. Demo samples are available at https://idiap.github.io/ssl-tts

Problem

Research questions and friction points this paper is trying to address.

Text-to-Speech (TTS)

Voice Simulation

Limited Resource Domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

kNN-TTS

self-supervised learning

speech synthesis

🔎 Similar Papers

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation