M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Non-autoregressive (NAR) text-to-speech (TTS) suffers from rigid length alignment between text and speech sequences, resulting in low naturalness and suboptimal computational efficiency. This paper proposes M3-TTS, an end-to-end NAR TTS framework built upon a multimodal Diffusion Transformer (DiT). Its core innovation lies in introducing joint diffusion Transformer layers to enforce cross-modal monotonic alignment—eliminating explicit duration modeling and heuristic pseudo-alignment strategies—while employing a single diffusion layer for fine-grained acoustic detail modeling, enabling zero-shot high-fidelity synthesis for the first time in NAR-TTS. Integrated with a mel-spectrogram VAE encoder-decoder and joint diffusion mechanisms, M3-TTS achieves state-of-the-art performance on Seed-TTS and AISHELL-3: word error rates of 1.36% (English) and 1.31% (Chinese), 3× faster training, and significantly improved naturalness over existing approaches.

Technology Category

Application Category

📝 Abstract

Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion transformer (MM-DiT) architecture. M3-TTS employs joint diffusion transformer layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion transformer layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3* training acceleration. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves state-of-the-art NAR performance with the lowest word error rates (1.36% English, 1.31% Chinese) while maintaining competitive naturalness scores. Code and demos will be available at https://wwwwxp.github.io/M3-TTS.

Problem

Research questions and friction points this paper is trying to address.

Achieving stable monotonic alignment between text and speech sequences

Enhancing acoustic detail modeling without pseudo-alignment constraints

Accelerating training efficiency while maintaining high-fidelity speech synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal diffusion transformer for cross-modal alignment

Mel-vae codec for 3x faster training acceleration

Single diffusion transformer layers enhance acoustic detail modeling

🔎 Similar Papers

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation

2024-09-13arXiv.orgCitations: 1