OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes OneVoice, the first unified zero-shot voice conversion framework that overcomes the limitations of existing methods, which typically rely on specialized models tailored for language preservation, emotional expression, or singing. OneVoice leverages a VAE-free next-patch diffusion mechanism combined with a Mixture-of-Experts architecture to enable efficient and versatile modeling. It introduces a dual-path routing strategy that integrates shared experts with scene-aware domain experts, along with hierarchical gating to disentangle and fuse scene-specific prosodic features. A two-stage LoRA-enhanced training scheme further supports flexible control across diverse scenarios. Experimental results demonstrate that OneVoice matches or surpasses dedicated models across three distinct tasks while enabling ultra-fast inference with as few as two diffusion steps.

Technology Category

Application Category

📝 Abstract
Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features are fused into each layer via a gated mechanism, allowing adaptive usage of prosody information. Furthermore, to enable the core idea and alleviate the imbalanced issue (abundant speech vs. scarce singing), we adopt a two-stage progressive training that includes foundational pre-training and scenario enhancement with LoRA-based domain experts. Experiments show that OneVoice matches or surpasses specialized models across all three scenarios, while verifying flexible control over scenarios and offering a fast decoding version as few as 2 steps. Code and model will be released soon.
Problem

Research questions and friction points this paper is trying to address.

voice conversion
zero-shot
unified model
scenario fragmentation
speaker cloning
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified voice conversion
zero-shot
Mixture-of-Experts
diffusion-based modeling
prosody-aware gating
🔎 Similar Papers
No similar papers found.
Z
Zhichao Wang
JIUTIAN Research, China Mobile, Beijing, China
T
Tao Li
JIUTIAN Research, China Mobile, Beijing, China
W
Wenshuo Ge
JIUTIAN Research, China Mobile, Beijing, China
Z
Zihao Cui
JIUTIAN Research, China Mobile, Beijing, China
S
Shilei Zhang
JIUTIAN Research, China Mobile, Beijing, China
Junlan Feng
Junlan Feng
Chief Scientist at China Mobile Research
Natural LanguageMachine LearningSpeech ProcessingData Mining