🤖 AI Summary
This work proposes X-Voice, a zero-shot cross-lingual voice cloning method for text-to-speech synthesis that requires neither target-language speech samples nor their corresponding transcriptions. Built upon the F5-TTS architecture, X-Voice employs a unified International Phonetic Alphabet (IPA) representation, incorporates dual-level language identifiers, and leverages a two-stage training strategy—comprising synthetic audio prompt generation and transcript-free fine-tuning—combined with a Classifier-Free Guidance-based decoupled scheduling mechanism. This enables zero-shot voice cloning across 30 languages using any reference speaker’s voice. Experimental results demonstrate that X-Voice significantly outperforms existing flow-matching systems such as LEMAS-TTS in both subjective and objective evaluations, achieving cross-lingual cloning performance comparable to billion-parameter models like Qwen3-TTS.
📝 Abstract
In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.