Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

πŸ“… 2025-03-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address inefficiency, coarse-grained controllability, and integration challenges arising from multi-stage modeling in zero-shot text-to-speech (TTS), this paper proposes a single-stream disentangled speech tokenization framework. Our core contribution is BiCodecβ€”a novel single-stream speech codec that explicitly disentangles semantic tokens (low-bitrate, temporally aligned) from speaker tokens (fixed-length, globally aggregated) for the first time. Integrated with the Qwen2.5 large language model and chain-of-thought (CoT) reasoning, the framework enables joint control over speaking style and fine-grained acoustic parameters (e.g., pitch, speaking rate). Trained on VoxBox (100K hours), our method achieves state-of-the-art performance on zero-shot voice cloning, significantly outperforming prior approaches in controllability, naturalness, and cross-speaker generalization. We publicly release the code, models, and audio samples.

Technology Category

Application Category

πŸ“ Abstract
Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.
Problem

Research questions and friction points this paper is trying to address.

Improves efficiency in zero-shot text-to-speech synthesis
Enables fine-grained control over speech attributes
Introduces a large dataset for controllable TTS research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stream speech codec BiCodec
Qwen2.5 LLM with CoT generation
VoxBox dataset for controllable TTS
πŸ”Ž Similar Papers
Xinsheng Wang
Xinsheng Wang
Hong Kong University of Science and Technology (HKUST)
speech synthesissinging voice synthesisvoice conversion
M
Mingqi Jiang
SparkAudio Open Source Community, Shanghai Mobvoi Information Technology Co., Ltd
Z
Ziyang Ma
Shanghai Jiao Tong University, Nanyang Technological University
Z
Ziyu Zhang
ASLP@NPU, Northwestern Polytechnical University
Songxiang Liu
Songxiang Liu
Meituan multi-modal team, PhD (The Chinese University of Hong Kong)
Multi-ModalLLMAudio foundation modelSpeech synthesis
L
Linqin Li
Shanghai Mobvoi Information Technology Co., Ltd
Z
Zheng Liang
Shanghai Jiao Tong University
Qixi Zheng
Qixi Zheng
Shanghai Jiao Tong University
voice conversiontext-to-speechdiffusion modelsflow matching
R
Rui Wang
Shanghai Mobvoi Information Technology Co., Ltd
Xiaoqin Feng
Xiaoqin Feng
University of Southern California
LLM/Agent/Application/Data/Evaluation
W
Weizhen Bian
Hong Kong University of Science and Technology
Z
Zhen Ye
Hong Kong University of Science and Technology
S
Sitong Cheng
Hong Kong University of Science and Technology
Ruibin Yuan
Ruibin Yuan
HKUST
Artificial IntelligenceMusic GenerationMusic Information RetrievalComputer Music
Zhixian Zhao
Zhixian Zhao
Northwestern Polytechnical University
Emotion Speech RecognitionUnderstanding and Generation
Xinfa Zhu
Xinfa Zhu
Northwestern Polytechnical University
speech generation
Jiahao Pan
Jiahao Pan
Hong Kong University of Science and Technology
Speech ProcessingSpeech EnhancmentMusic Generation
Liumeng Xue
Liumeng Xue
Hong Kong University of Science and Technology
Audio Speech and Language ProcessingSpeech Generation
Pengcheng Zhu
Pengcheng Zhu
Fuxi AI Lab, NetEase Inc.
speech synthesissinging voice synthesistalking avatarvoice conversion
Yunlin Chen
Yunlin Chen
Mobvoi
speechavatar
Zhifei Li
Zhifei Li
Research Scientist at Google
machine translationnatural language processingmachine learningwireless networks
X
Xie Chen
ASLP@NPU, Northwestern Polytechnical University
L
Lei Xie
ASLP@NPU, Northwestern Polytechnical University
Y
Yike Guo
Hong Kong University of Science and Technology
W
Wei-feng Xue
Hong Kong University of Science and Technology