🤖 AI Summary
Current controllable TTS models struggle with fine-grained, multi-attribute (e.g., emotion, timbre) disentangled control under natural language prompts and are hindered by the scarcity of high-quality annotated data. To address this, we propose a two-stage controllable TTS framework. In the first stage, a quantized masked autoencoder learns discrete, style-rich intermediate representations—introduced here for the first time as a semantic bridge between text and speech. In the second stage, an autoregressive Transformer jointly models linguistic content and stylistic attributes, explicitly decoupling speech generation from style modulation. This design enables natural language–driven cross-attribute control, zero-shot style transfer, and strong robustness to textual content variations. Experiments demonstrate significant improvements in speech naturalness, control accuracy, and generalization across unseen styles and speakers. Audio demos and code are publicly released.
📝 Abstract
Controllable TTS models with natural language prompts often lack the ability for fine-grained control and face a scarcity of high-quality data. We propose a two-stage style-controllable TTS system with language models, utilizing a quantized masked-autoencoded style-rich representation as an intermediary. In the first stage, an autoregressive transformer is used for the conditional generation of these style-rich tokens from text and control signals. The second stage generates codec tokens from both text and sampled style-rich tokens. Experiments show that training the first-stage model on extensive datasets enhances the content robustness of the two-stage model as well as control capabilities over multiple attributes. By selectively combining discrete labels and speaker embeddings, we explore fully controlling the speaker's timbre and other stylistic information, and adjusting attributes like emotion for a specified speaker. Audio samples are available at https://style-ar-tts.github.io.