Selective Classifier-free Guidance for Zero-shot Text-to-speech

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

In zero-shot text-to-speech (TTS), balancing speaker fidelity and text fidelity remains challenging. This paper introduces, for the first time, classifier-free guidance (CFG)—a technique from image generation—into speech synthesis, proposing a phased selective CFG strategy: standard CFG is applied early to ensure text alignment, while selective CFG is activated later to enhance speaker similarity. By decoupling speech and text conditioning, incorporating multilingual text encoders, and introducing temporal dynamic modulation, we uncover the critical influence of linguistic representations on CFG efficacy—and reveal substantial cross-lingual differences in guidance sensitivity, particularly between English and Chinese. Experiments demonstrate that our method significantly improves speaker similarity (average +12.3 MOS) without compromising text accuracy, validating both the effectiveness of phased CFG and its potential for cross-lingual adaptation.

Technology Category

Application Category

📝 Abstract

In zero-shot text-to-speech, achieving a balance between fidelity to the target speaker and adherence to text content remains a challenge. While classifier-free guidance (CFG) strategies have shown promising results in image generation, their application to speech synthesis are underexplored. Separating the conditions used for CFG enables trade-offs between different desired characteristics in speech synthesis. In this paper, we evaluate the adaptability of CFG strategies originally developed for image generation to speech synthesis and extend separated-condition CFG approaches for this domain. Our results show that CFG strategies effective in image generation generally fail to improve speech synthesis. We also find that we can improve speaker similarity while limiting degradation of text adherence by applying standard CFG during early timesteps and switching to selective CFG only in later timesteps. Surprisingly, we observe that the effectiveness of a selective CFG strategy is highly text-representation dependent, as differences between the two languages of English and Mandarin can lead to different results even with the same model.

Problem

Research questions and friction points this paper is trying to address.

Balancing speaker fidelity and text adherence in zero-shot TTS

Adapting image-based CFG strategies to speech synthesis tasks

Addressing text-representation dependency in selective CFG effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted classifier-free guidance for speech synthesis

Selective CFG applied during later timesteps only

Text-representation dependent effectiveness across languages

🔎 Similar Papers

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer