CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

πŸ“… 2025-05-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
CosyVoice 2 advances low-latency dual-stream speech synthesis and audio quality but suffers from narrow language coverage, domain limitation, small-scale training data, poor robustness to diverse text formats, and insufficient post-training techniques. To enable zero-shot multilingual speech synthesis in real-world scenarios, CosyVoice 3 introduces three key innovations: (1) a novel multi-task supervised speech tokenizer; (2) a transferable, differentiable prosody and alignment reward model; and (3) joint scaling of a million-hour multilingual and multidialectal dataset with a 1.5B-parameter model, integrating chunk-aware streaming flow matching, LLM-enhanced modeling, and reinforcement-based post-training. Evaluated across 9 languages and 18 dialects, CosyVoice 3 achieves human-level audio quality and significantly improves cross-lingual zero-shot synthesis in prosodic naturalness, speaker fidelity, and text–speech alignment.

Technology Category

Application Category

πŸ“ Abstract
In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.
Problem

Research questions and friction points this paper is trying to address.

Enhances multilingual speech synthesis in diverse domains
Improves prosody naturalness and speaker similarity
Scales model and data for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel speech tokenizer improves prosody naturalness
Differentiable reward model enhances post-training
Scaled dataset and model size boosts performance
πŸ”Ž Similar Papers
No similar papers found.
Zhihao Du
Zhihao Du
Alibaba
Speech separationspeech enchancementspeaker diarization
C
Changfeng Gao
Speech Team, Tongyi Lab, Alibaba Group
Y
Yuxuan Wang
Speech Team, Tongyi Lab, Alibaba Group
F
Fan Yu
Speech Team, Tongyi Lab, Alibaba Group
T
Tianyu Zhao
Speech Team, Tongyi Lab, Alibaba Group
H
Hao Wang
Speech Team, Tongyi Lab, Alibaba Group
X
Xiang Lv
Speech Team, Tongyi Lab, Alibaba Group
H
Hui Wang
Speech Team, Tongyi Lab, Alibaba Group
Xian Shi
Xian Shi
Qwen Team, Alibaba
speech recognitionaudio LLMOmni
K
Keyu An
Speech Team, Tongyi Lab, Alibaba Group
Guanrou Yang
Guanrou Yang
Shanghai Jiao Tong University
Y
Yabin Li
Speech Team, Tongyi Lab, Alibaba Group
Y
Yanni Chen
Speech Team, Tongyi Lab, Alibaba Group
Z
Zhifu Gao
Speech Team, Tongyi Lab, Alibaba Group
Q
Qian Chen
Speech Team, Tongyi Lab, Alibaba Group
Yue Gu
Yue Gu
Speech Team, Tongyi Lab, Alibaba Group
M
Mengzhe Chen
Speech Team, Tongyi Lab, Alibaba Group
Yafeng Chen
Yafeng Chen
University of Science and Technology of China
Large Audio Language ModelSpeech Signal ProcessingDeep Learning
Shiliang Zhang
Shiliang Zhang
Department of Computer Science, School of EECS, Peking University
Multimedia Information RetrievalMultimedia SystemsVisual Search
W
Wen Wang
Speech Team, Tongyi Lab, Alibaba Group
J
Jieping Ye
Speech Team, Tongyi Lab, Alibaba Group