Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot voice conversion (VC) methods struggle to simultaneously preserve linguistic content and accurately transfer target speaker style, particularly exhibiting significant limitations in prosodic controllability. To address this, we propose a controllable zero-shot VC framework comprising three key components: (1) explicit disentanglement of speech content and prosodic information; (2) context-learning-driven flow-matching Transformer for target timbre synthesis; and (3) a non-autoregressive masked generative Transformer that predicts discrete prosodic tokens—enabling fine-grained, prompt-editable control over intonation, rhythm, and other prosodic attributes. This is the first work to deeply integrate discrete prosody modeling with context learning, achieving complete content–prosody disentanglement. Experiments demonstrate state-of-the-art performance on zero-shot VC benchmarks, with substantial improvements in prosodic fidelity and speaker style transfer accuracy over prior approaches.

Technology Category

Application Category

📝 Abstract
Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive speaking style of the target speaker, thereby limiting the controllability of voice conversion. In this work, we propose Discl-VC, a novel voice conversion framework that disentangles content and prosody information from self-supervised speech representations and synthesizes the target speaker's voice through in-context learning with a flow matching transformer. To enable precise control over the prosody of generated speech, we introduce a mask generative transformer that predicts discrete prosody tokens in a non-autoregressive manner based on prompts. Experimental results demonstrate the superior performance of Discl-VC in zero-shot voice conversion and its remarkable accuracy in prosody control for synthesized speech.
Problem

Research questions and friction points this paper is trying to address.

Improves zero-shot voice conversion accuracy for unseen speakers
Enhances controllability of source and target speaker styles
Enables precise prosody control in synthesized speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles content and prosody from speech
Uses in-context learning with flow matching transformer
Predicts prosody tokens non-autoregressively via prompts
🔎 Similar Papers
No similar papers found.
K
Kaidi Wang
School of Informatics, Xiamen University, China
Wenhao Guan
Wenhao Guan
Xiamen University
speech
Ziyue Jiang
Ziyue Jiang
Zhejiang University
Speech Synthesis
H
Hukai Huang
School of Informatics, Xiamen University, China
P
Peijie Chen
School of Informatics, Xiamen University, China
Weijie Wu
Weijie Wu
Roblox
Computer Networks
Q
Q. Hong
School of Informatics, Xiamen University, China
L
Lin Li
School of Electronic Science and Engineering, Xiamen University, China