🤖 AI Summary
This work addresses the challenge in diffusion-based speech synthesis of simultaneously modeling prosody and preserving phonetic distinctiveness. To this end, the authors propose a two-stage curriculum pretraining approach: first, phonetic contextual representations are learned via masked language modeling (MLM); subsequently, a speaker-conditional dual-stream encoder is trained using SigLIP-style cross-modal contrastive learning, augmented with a mixed-phoneme batching strategy. The method yields significant improvements in intelligibility, speaker similarity, and perceptual quality when integrated into both Grad-TTS and latent diffusion TTS frameworks. Experimental results further reveal that enhancements in embedding space metrics do not necessarily translate to better generation performance, and while fine-tuning on homophones improves prosody retrieval, it compromises phonetic discriminability and overall synthesis quality.
📝 Abstract
We investigate multi-stage pretraining for prosody modeling in diffusion-based TTS. A speaker-conditioned dual-stream encoder is trained with masked language modeling followed by SigLIP-style cross-modal contrastive learning using mixed-phoneme batches, with an additional same-phoneme refinement stage studied separately. We evaluate intrinsic text-audio retrieval and downstream synthesis in Grad-TTS and a latent diffusion TTS system. The two-stage curriculum (MLM + mixed-phoneme contrastive learning) achieves the best overall synthesis quality in terms of intelligibility, speaker similarity, and perceptual measures. Although same-phoneme refinement improves prosodic retrieval, it reduces phoneme discrimination and degrades synthesis. These findings indicate that improvements in embedding-space metrics do not necessarily translate to better generative performance and highlight the need to balance phoneme discrimination and prosodic sensitivity in TTS pretraining.