X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work proposes X-Voice, a zero-shot cross-lingual voice cloning method for text-to-speech synthesis that requires neither target-language speech samples nor their corresponding transcriptions. Built upon the F5-TTS architecture, X-Voice employs a unified International Phonetic Alphabet (IPA) representation, incorporates dual-level language identifiers, and leverages a two-stage training strategy—comprising synthetic audio prompt generation and transcript-free fine-tuning—combined with a Classifier-Free Guidance-based decoupled scheduling mechanism. This enables zero-shot voice cloning across 30 languages using any reference speaker’s voice. Experimental results demonstrate that X-Voice significantly outperforms existing flow-matching systems such as LEMAS-TTS in both subjective and objective evaluations, achieving cross-lingual cloning performance comparable to billion-parameter models like Qwen3-TTS.
📝 Abstract
In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.
Problem

Research questions and friction points this paper is trying to address.

zero-shot
cross-lingual
voice cloning
multilingual
speech synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot voice cloning
multilingual TTS
two-stage training
IPA-based representation
Classifier-Free Guidance
🔎 Similar Papers
No similar papers found.
R
Rixi Xu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Qingyu Liu
Qingyu Liu
Electronic and Computer Engineering, Peking University
wireless networkingmobile networkinginternet of thingsintelligent transportation
Haitao Li
Haitao Li
College of Computer Science and Technology, Zhejiang University
MultimodalMedical Image AnalysisECG
Yushen Chen
Yushen Chen
Shanghai Jiao Tong University
Speech and Language Processing
Zhikang Niu
Zhikang Niu
Shanghai Jiao Tong University
Speech Synthesis
Y
Yunting Yang
Geely Automobile Research Institute (Ningbo) Company Ltd.
J
Jian Zhao
Geely Automobile Research Institute (Ningbo) Company Ltd.
K
Ke Li
Beijing Haitian Ruisheng Science Technology Ltd.
Berrak Sisman
Berrak Sisman
Assistant Professor (ECE & DSAI), Johns Hopkins University
Machine LearningAffective ComputingSpeech SynthesisVoice ConversionAnti-spoofing
Q
Qinyuan Cheng
Shanghai Innovation Institute; Fudan University
X
Xipeng Qiu
Shanghai Innovation Institute; Fudan University
K
Kai Yu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Xie Chen
Xie Chen
Shanghai Jiao Tong University <- Microsoft <- Cambridge University
Machine LearningSpeech RecognitionSpeech SynthesisSpeech&Audio Processing