CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

CosyVoice 2 advances low-latency dual-stream speech synthesis and audio quality but suffers from narrow language coverage, domain limitation, small-scale training data, poor robustness to diverse text formats, and insufficient post-training techniques. To enable zero-shot multilingual speech synthesis in real-world scenarios, CosyVoice 3 introduces three key innovations: (1) a novel multi-task supervised speech tokenizer; (2) a transferable, differentiable prosody and alignment reward model; and (3) joint scaling of a million-hour multilingual and multidialectal dataset with a 1.5B-parameter model, integrating chunk-aware streaming flow matching, LLM-enhanced modeling, and reinforcement-based post-training. Evaluated across 9 languages and 18 dialects, CosyVoice 3 achieves human-level audio quality and significantly improves cross-lingual zero-shot synthesis in prosodic naturalness, speaker fidelity, and text–speech alignment.

Technology Category

Application Category

📝 Abstract

In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.

Problem

Research questions and friction points this paper is trying to address.

Enhances multilingual speech synthesis in diverse domains

Improves prosody naturalness and speaker similarity

Scales model and data for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel speech tokenizer improves prosody naturalness

Differentiable reward model enhances post-training

Scaled dataset and model size boosts performance

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Lab