🤖 AI Summary
In controllable text-to-speech (TTS), speaker timbre and expressive style are highly entangled, hindering independent control. This paper proposes a disentanglement framework based on a latent-space diffusion Transformer for fine-grained, jointly text- and reference-audio-driven style control. Key contributions include: (1) a novel Style-CLAP encoder that achieves acoustic style alignment via cross-modal contrastive learning; (2) chained classifier-free guidance (cCFG), enabling orthogonal control over linguistic content, speaker identity, and expressive style; and (3) REPA distillation—integrating Whisper-derived semantic features with hierarchical conditional dropout—to enhance training stability and convergence speed. Experiments demonstrate significant improvements over open-source baselines in style controllability, while maintaining high intelligibility, naturalness, and robust inference-time controllability.
📝 Abstract
Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.