DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

In controllable text-to-speech (TTS), speaker timbre and expressive style are highly entangled, hindering independent control. This paper proposes a disentanglement framework based on a latent-space diffusion Transformer for fine-grained, jointly text- and reference-audio-driven style control. Key contributions include: (1) a novel Style-CLAP encoder that achieves acoustic style alignment via cross-modal contrastive learning; (2) chained classifier-free guidance (cCFG), enabling orthogonal control over linguistic content, speaker identity, and expressive style; and (3) REPA distillation—integrating Whisper-derived semantic features with hierarchical conditional dropout—to enhance training stability and convergence speed. Experiments demonstrate significant improvements over open-source baselines in style controllability, while maintaining high intelligibility, naturalness, and robust inference-time controllability.

Technology Category

Application Category

📝 Abstract

Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.

Problem

Research questions and friction points this paper is trying to address.

Achieves independent control of speaker timbre and speaking style in TTS

Enables fine-grained adjustment of content, timbre, and style attributes

Improves style controllability while maintaining speech intelligibility and naturalness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Diffusion Transformer framework with explicit disentanglement

Chained classifier-free guidance for independent attribute adjustment

Representation alignment distills Whisper features into intermediate representations

🔎 Similar Papers

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts