VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work introduces the first unified end-to-end framework for multilingual speech cloning and editing, addressing the limitation of prior approaches that require separate models for zero-shot cross-lingual TTS and speech editing. Methodologically, it proposes an autoregressive neural codec language model built upon Qwen3, incorporating phoneme-agnostic cross-lingual text encoding and time-aligned text–speech token reordering to formulate both tasks as a single sequence generation problem. Its key contribution is the first joint modeling of zero-shot TTS and speech editing across 11 languages—including low-resource ones—within a single architecture. Experiments demonstrate state-of-the-art performance in speech naturalness, cross-lingual generalization, and editing fidelity, enabling high-quality, seamless voice cloning and localized speech modifications.

Technology Category

Application Category

📝 Abstract

We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

Problem

Research questions and friction points this paper is trying to address.

Unifies multilingual speech editing and zero-shot TTS synthesis across 11 languages

Handles speech generation and editing as single sequence generation problem

Advances multilingual speech applications with unified autoregressive approach

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies multilingual speech editing and zero-shot TTS

Uses Qwen3 model for phoneme-free cross-lingual processing

Implements novel token reordering mechanism for sequence generation

🔎 Similar Papers

No similar papers found.