DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

📅 2024-06-17
🏛️ arXiv.org
📈 Citations: 14
Influential: 1
📄 PDF
🤖 AI Summary
To address the scalability limitations of zero-shot text-to-speech (TTS) systems arising from reliance on phoneme and duration priors, this paper proposes DiTTo-TTS—the first large-scale, end-to-end TTS system built exclusively on a diffusion-based Transformer architecture. DiTTo-TTS eliminates explicit phoneme and duration modeling by leveraging pre-trained cross-modal encoders: mBART for text and Whisper for speech. It achieves unsupervised alignment via cross-attention and latent-space semantic guidance, and incorporates explicit total speech length prediction to ensure temporal coherence. Trained on 82K hours of multilingual data, the 790M-parameter model surpasses existing state-of-the-art zero-shot TTS methods across naturalness, intelligibility, and speaker similarity. The approach significantly simplifies training—requiring no linguistic annotations or forced alignment—and all high-fidelity speech samples are publicly released.

Technology Category

Application Category

📝 Abstract
Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io.
Problem

Research questions and friction points this paper is trying to address.

Improve text-to-speech scalability
Eliminate domain-specific dependencies
Enhance zero-shot performance metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiTTo-TTS uses Diffusion Transformers
Variable-length modeling improves TTS
Semantic alignment enhances speech quality
🔎 Similar Papers
No similar papers found.