Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Current controllable TTS models struggle with fine-grained, multi-attribute (e.g., emotion, timbre) disentangled control under natural language prompts and are hindered by the scarcity of high-quality annotated data. To address this, we propose a two-stage controllable TTS framework. In the first stage, a quantized masked autoencoder learns discrete, style-rich intermediate representations—introduced here for the first time as a semantic bridge between text and speech. In the second stage, an autoregressive Transformer jointly models linguistic content and stylistic attributes, explicitly decoupling speech generation from style modulation. This design enables natural language–driven cross-attribute control, zero-shot style transfer, and strong robustness to textual content variations. Experiments demonstrate significant improvements in speech naturalness, control accuracy, and generalization across unseen styles and speakers. Audio demos and code are publicly released.

Technology Category

Application Category

📝 Abstract

Controllable TTS models with natural language prompts often lack the ability for fine-grained control and face a scarcity of high-quality data. We propose a two-stage style-controllable TTS system with language models, utilizing a quantized masked-autoencoded style-rich representation as an intermediary. In the first stage, an autoregressive transformer is used for the conditional generation of these style-rich tokens from text and control signals. The second stage generates codec tokens from both text and sampled style-rich tokens. Experiments show that training the first-stage model on extensive datasets enhances the content robustness of the two-stage model as well as control capabilities over multiple attributes. By selectively combining discrete labels and speaker embeddings, we explore fully controlling the speaker's timbre and other stylistic information, and adjusting attributes like emotion for a specified speaker. Audio samples are available at https://style-ar-tts.github.io.

Problem

Research questions and friction points this paper is trying to address.

Lack of fine-grained control in text-to-speech synthesis

Scarcity of high-quality data for controllable TTS

Need for robust multi-attribute style control in speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage style-controllable TTS system

Quantized masked-autoencoded style-rich representation

Autoregressive transformer for conditional generation

🔎 Similar Papers

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec