VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces the first unified end-to-end framework for multilingual speech cloning and editing, addressing the limitation of prior approaches that require separate models for zero-shot cross-lingual TTS and speech editing. Methodologically, it proposes an autoregressive neural codec language model built upon Qwen3, incorporating phoneme-agnostic cross-lingual text encoding and time-aligned text–speech token reordering to formulate both tasks as a single sequence generation problem. Its key contribution is the first joint modeling of zero-shot TTS and speech editing across 11 languages—including low-resource ones—within a single architecture. Experiments demonstrate state-of-the-art performance in speech naturalness, cross-lingual generalization, and editing fidelity, enabling high-quality, seamless voice cloning and localized speech modifications.

Technology Category

Application Category

📝 Abstract
We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.
Problem

Research questions and friction points this paper is trying to address.

Unifies multilingual speech editing and zero-shot TTS synthesis across 11 languages
Handles speech generation and editing as single sequence generation problem
Advances multilingual speech applications with unified autoregressive approach
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies multilingual speech editing and zero-shot TTS
Uses Qwen3 model for phoneme-free cross-lingual processing
Implements novel token reordering mechanism for sequence generation
🔎 Similar Papers
No similar papers found.