Combining Masked Language Modeling and Cross-Modal Contrastive Learning for Prosody-Aware TTS

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in diffusion-based speech synthesis of simultaneously modeling prosody and preserving phonetic distinctiveness. To this end, the authors propose a two-stage curriculum pretraining approach: first, phonetic contextual representations are learned via masked language modeling (MLM); subsequently, a speaker-conditional dual-stream encoder is trained using SigLIP-style cross-modal contrastive learning, augmented with a mixed-phoneme batching strategy. The method yields significant improvements in intelligibility, speaker similarity, and perceptual quality when integrated into both Grad-TTS and latent diffusion TTS frameworks. Experimental results further reveal that enhancements in embedding space metrics do not necessarily translate to better generation performance, and while fine-tuning on homophones improves prosody retrieval, it compromises phonetic discriminability and overall synthesis quality.
📝 Abstract
We investigate multi-stage pretraining for prosody modeling in diffusion-based TTS. A speaker-conditioned dual-stream encoder is trained with masked language modeling followed by SigLIP-style cross-modal contrastive learning using mixed-phoneme batches, with an additional same-phoneme refinement stage studied separately. We evaluate intrinsic text-audio retrieval and downstream synthesis in Grad-TTS and a latent diffusion TTS system. The two-stage curriculum (MLM + mixed-phoneme contrastive learning) achieves the best overall synthesis quality in terms of intelligibility, speaker similarity, and perceptual measures. Although same-phoneme refinement improves prosodic retrieval, it reduces phoneme discrimination and degrades synthesis. These findings indicate that improvements in embedding-space metrics do not necessarily translate to better generative performance and highlight the need to balance phoneme discrimination and prosodic sensitivity in TTS pretraining.
Problem

Research questions and friction points this paper is trying to address.

prosody modeling
text-to-speech synthesis
pretraining
phoneme discrimination
diffusion-based TTS
Innovation

Methods, ideas, or system contributions that make the work stand out.

masked language modeling
cross-modal contrastive learning
prosody-aware TTS
multi-stage pretraining
phoneme discrimination
🔎 Similar Papers
No similar papers found.
Kirill Borodin
Kirill Borodin
MTUCI
deep learning for audiogen AIsafe AI
Vasiliy Kudryavtsev
Vasiliy Kudryavtsev
MTUCI
machine learning
M
Maxim Maslov
MTUCI, Moscow, Russia
N
Nikita Vasiliev
MTUCI, Moscow, Russia
M
Mikhail Gorodnichev
MTUCI, Moscow, Russia
Grach Mkrtchian
Grach Mkrtchian
MTUCI
Artificial IntelligenceAlgorithmsData Structures