Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the high computational cost and slow training convergence of diffusion-based text-to-speech (TTS) models—stemming from implicit intermediate representation learning—this paper proposes a bimodal latent state alignment strategy. Our method jointly introduces text-guided and speech-guided supervision to achieve, for the first time, cross-modal alignment between latent representations and discriminative speech features, thereby significantly reducing reliance on pure diffusion process modeling. The framework integrates context-aware text encoding, a speech semantic refinement module, and end-to-end joint optimization with the diffusion backbone. Experiments demonstrate that our approach accelerates training convergence by 2×, while synthesized speech achieves new state-of-the-art performance: +0.32 MOS (naturalness), −18.7% phoneme error rate (PER), and superior prosodic fidelity compared to existing SOTA baselines.

Technology Category

Application Category

📝 Abstract

The goal of this paper is to optimize the training process of diffusion-based text-to-speech models. While recent studies have achieved remarkable advancements, their training demands substantial time and computational costs, largely due to the implicit guidance of diffusion models in learning complex intermediate representations. To address this, we propose A-DMA, an effective strategy for Accelerating training with Dual Modality Alignment. Our method introduces a novel alignment pipeline leveraging both text and speech modalities: text-guided alignment, which incorporates contextual representations, and speech-guided alignment, which refines semantic representations. By aligning hidden states with discriminative features, our training scheme reduces the reliance on diffusion models for learning complex representations. Extensive experiments demonstrate that A-DMA doubles the convergence speed while achieving superior performance over baselines. Code and demo samples are available at: https://github.com/ZhikangNiu/A-DMA

Problem

Research questions and friction points this paper is trying to address.

Optimize training of diffusion-based text-to-speech models

Reduce time and computational costs in model training

Align text and speech modalities for faster convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual modality alignment for faster training

Text and speech guided representation learning

Aligns hidden states with discriminative features

🔎 Similar Papers

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer