🤖 AI Summary
This work introduces the first unified end-to-end framework for multilingual speech cloning and editing, addressing the limitation of prior approaches that require separate models for zero-shot cross-lingual TTS and speech editing. Methodologically, it proposes an autoregressive neural codec language model built upon Qwen3, incorporating phoneme-agnostic cross-lingual text encoding and time-aligned text–speech token reordering to formulate both tasks as a single sequence generation problem. Its key contribution is the first joint modeling of zero-shot TTS and speech editing across 11 languages—including low-resource ones—within a single architecture. Experiments demonstrate state-of-the-art performance in speech naturalness, cross-lingual generalization, and editing fidelity, enabling high-quality, seamless voice cloning and localized speech modifications.
📝 Abstract
We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.