TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing TTS systems struggle with fine-grained, time-varying emotional control and typically require large-scale emotional data for fine-tuning—compromising zero-shot voice cloning capability and speech naturalness. To address this, we propose the first integration of ControlNet into a flow-matching TTS framework, introducing a lightweight, plug-and-play conditional control architecture that avoids fine-tuning the backbone model. Specifically, we freeze the pre-trained large model and train only a compact, learnable copy network to process time-aligned emotional conditions. Structural optimization via block-wise analysis enables emotion-specific flow steps and intensity-controllable modulation. Our method preserves the original model’s zero-shot cloning ability and speech naturalness while significantly enhancing emotional controllability. Evaluated on Emo-SIM and Aro-Val SIM metrics, it achieves state-of-the-art performance, demonstrating effectiveness, modularity, and deployment flexibility.

Technology Category

Application Category

📝 Abstract

Recent advances in text-to-speech (TTS) have enabled natural speech synthesis, but fine-grained, time-varying emotion control remains challenging. Existing methods often allow only utterance-level control and require full model fine-tuning with a large emotion speech dataset, which can degrade performance. Inspired by adding conditional control to the existing model in ControlNet (Zhang et al, 2023), we propose the first ControlNet-based approach for controllable flow-matching TTS (TTS-CtrlNet), which freezes the original model and introduces a trainable copy of it to process additional conditions. We show that TTS-CtrlNet can boost the pretrained large TTS model by adding intuitive, scalable, and time-varying emotion control while inheriting the ability of the original model (e.g., zero-shot voice cloning & naturalness). Furthermore, we provide practical recipes for adding emotion control: 1) optimal architecture design choice with block analysis, 2) emotion-specific flow step, and 3) flexible control scale. Experiments show that ours can effectively add an emotion controller to existing TTS, and achieves state-of-the-art performance with emotion similarity scores: Emo-SIM and Aro-Val SIM. The project page is available at: https://curryjung.github.io/ttsctrlnet_project_page

Problem

Research questions and friction points this paper is trying to address.

Fine-grained time-varying emotion control in TTS

Avoiding full model fine-tuning for emotion control

Enhancing pretrained TTS models with scalable emotion control

Innovation

Methods, ideas, or system contributions that make the work stand out.

ControlNet-based controllable flow-matching TTS

Freezes original model, adds trainable copy

Time-varying emotion control with optimal architecture

🔎 Similar Papers

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

2024-09-25arXiv.orgCitations: 4