🤖 AI Summary
This work proposes the Qwen3-TTS series of models to address key challenges in multilingual, low-latency, high-fidelity speech synthesis—namely voice cloning, controllability, and streaming generation. The core innovations include a dual-track language model architecture enabling real-time streaming synthesis and two novel speech tokenizers: a semantics-guided Qwen-TTS-Tokenizer-25Hz and an ultra-low-latency multi-codebook Qwen-TTS-Tokenizer-12Hz (12.5Hz). Integrated with chunk-wise DiT-based waveform reconstruction and a lightweight causal ConvNet, the system achieves a 97ms first-packet latency, 3-second voice cloning, support for 10 languages, and fine-grained description-driven control. It attains state-of-the-art performance on multilingual TTS benchmarks, InstructTTSEval, and long-form speech evaluation. Both the models and tokenizers are publicly released under the Apache 2.0 license.
📝 Abstract
In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.