ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Large-scale zero-shot text-to-speech (TTS) models achieve high speech quality but suffer from excessive parameter counts and slow inference. This paper proposes ZipVoice, a lightweight zero-shot TTS model based on flow matching. Our method introduces three key innovations: (1) the Zipformer—a novel flow-matching decoder architecture designed for efficiency; (2) an average upsampling alignment mechanism that replaces autoregressive or attention-based duration modeling; and (3) flow distillation, which compresses the teacher model without classifier-free guidance, eliminating its computational overhead. Trained on 100K hours of multilingual speech data, ZipVoice achieves state-of-the-art speech quality while reducing model size by 3× and accelerating end-to-end inference by up to 30× over prior zero-shot TTS systems. The approach thus bridges the longstanding trade-off between synthesis fidelity and real-time efficiency.

Technology Category

Application Category

📝 Abstract

Existing large-scale zero-shot text-to-speech (TTS) models deliver high speech quality but suffer from slow inference speeds due to massive parameters. To address this issue, this paper introduces ZipVoice, a high-quality flow-matching-based zero-shot TTS model with a compact model size and fast inference speed. Key designs include: 1) a Zipformer-based flow-matching decoder to maintain adequate modeling capabilities under constrained size; 2) Average upsampling-based initial speech-text alignment and Zipformer-based text encoder to improve speech intelligibility; 3) A flow distillation method to reduce sampling steps and eliminate the inference overhead associated with classifier-free guidance. Experiments on 100k hours multilingual datasets show that ZipVoice matches state-of-the-art models in speech quality, while being 3 times smaller and up to 30 times faster than a DiT-based flow-matching baseline. Codes, model checkpoints and demo samples are publicly available.

Problem

Research questions and friction points this paper is trying to address.

Slow inference speed in zero-shot TTS models

Large parameter size in existing TTS systems

Balancing quality and efficiency in speech synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zipformer-based flow-matching decoder for compact size

Average upsampling and Zipformer encoder for intelligibility

Flow distillation reduces steps and speeds inference

🔎 Similar Papers

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation