Combining Masked Language Modeling and Cross-Modal Contrastive Learning for Prosody-Aware TTS

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge in diffusion-based speech synthesis of simultaneously modeling prosody and preserving phonetic distinctiveness. To this end, the authors propose a two-stage curriculum pretraining approach: first, phonetic contextual representations are learned via masked language modeling (MLM); subsequently, a speaker-conditional dual-stream encoder is trained using SigLIP-style cross-modal contrastive learning, augmented with a mixed-phoneme batching strategy. The method yields significant improvements in intelligibility, speaker similarity, and perceptual quality when integrated into both Grad-TTS and latent diffusion TTS frameworks. Experimental results further reveal that enhancements in embedding space metrics do not necessarily translate to better generation performance, and while fine-tuning on homophones improves prosody retrieval, it compromises phonetic discriminability and overall synthesis quality.

Technology Category

Application Category

📝 Abstract

We investigate multi-stage pretraining for prosody modeling in diffusion-based TTS. A speaker-conditioned dual-stream encoder is trained with masked language modeling followed by SigLIP-style cross-modal contrastive learning using mixed-phoneme batches, with an additional same-phoneme refinement stage studied separately. We evaluate intrinsic text-audio retrieval and downstream synthesis in Grad-TTS and a latent diffusion TTS system. The two-stage curriculum (MLM + mixed-phoneme contrastive learning) achieves the best overall synthesis quality in terms of intelligibility, speaker similarity, and perceptual measures. Although same-phoneme refinement improves prosodic retrieval, it reduces phoneme discrimination and degrades synthesis. These findings indicate that improvements in embedding-space metrics do not necessarily translate to better generative performance and highlight the need to balance phoneme discrimination and prosodic sensitivity in TTS pretraining.

Problem

Research questions and friction points this paper is trying to address.

prosody modeling

text-to-speech synthesis

pretraining

phoneme discrimination

diffusion-based TTS

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked language modeling

cross-modal contrastive learning

prosody-aware TTS