Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address inefficiency, coarse-grained controllability, and integration challenges arising from multi-stage modeling in zero-shot text-to-speech (TTS), this paper proposes a single-stream disentangled speech tokenization framework. Our core contribution is BiCodec—a novel single-stream speech codec that explicitly disentangles semantic tokens (low-bitrate, temporally aligned) from speaker tokens (fixed-length, globally aggregated) for the first time. Integrated with the Qwen2.5 large language model and chain-of-thought (CoT) reasoning, the framework enables joint control over speaking style and fine-grained acoustic parameters (e.g., pitch, speaking rate). Trained on VoxBox (100K hours), our method achieves state-of-the-art performance on zero-shot voice cloning, significantly outperforming prior approaches in controllability, naturalness, and cross-speaker generalization. We publicly release the code, models, and audio samples.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.

Problem

Research questions and friction points this paper is trying to address.

Improves efficiency in zero-shot text-to-speech synthesis

Enables fine-grained control over speech attributes

Introduces a large dataset for controllable TTS research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stream speech codec BiCodec

Qwen2.5 LLM with CoT generation

VoxBox dataset for controllable TTS

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models