Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Zero-shot voice conversion (VC) faces challenges including timbre-prosody coupling leakage, difficulty in transferring target prosody, and low generation efficiency. To address these, this paper proposes the first rhythm-controllable zero-shot VC framework. Methodologically, it introduces a shortcut flow matching mechanism that jointly models noise scales and sampling steps, enabling high-fidelity speech synthesis in only two steps. It also pioneers the use of a Mask Generative Transformer for context-aware duration modeling, achieving precise disentanglement and transfer of target speaker timbre and fine-grained prosody. Furthermore, by integrating HuBERT-based content representations with a Diffusion Transformer, the framework achieves state-of-the-art timbre similarity on small-scale datasets, while significantly improving naturalness, intelligibility, and prosodic fidelity.

Technology Category

Application Category

📝 Abstract

Zero-Shot Voice Conversion (VC) aims to transform the source speaker's timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may leak through prosody, and transferring target prosody to synthesized speech is rarely studied. In light of this, we propose R-VC, a rhythm-controllable and efficient zero-shot voice conversion model. R-VC employs data perturbation techniques and discretize source speech into Hubert content tokens, eliminating much content-irrelevant information. By leveraging a Mask Generative Transformer for in-context duration modeling, our model adapts the linguistic content duration to the desired target speaking style, facilitating the transfer of the target speaker's rhythm. Furthermore, R-VC introduces a powerful Diffusion Transformer (DiT) with shortcut flow matching during training, conditioning the network not only on the current noise level but also on the desired step size, enabling high timbre similarity and quality speech generation in fewer sampling steps, even in just two, thus minimizing latency. Experimental results show that R-VC achieves comparable speaker similarity to state-of-the-art VC methods with a smaller dataset, and surpasses them in terms of speech naturalness, intelligibility and style transfer performance.

Problem

Research questions and friction points this paper is trying to address.

Achieving zero-shot voice conversion with rhythm control

Eliminating content-irrelevant information for better timbre transfer

Enhancing speech quality and efficiency with fewer sampling steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discretizes speech into Hubert content tokens

Uses Mask Generative Transformer for duration modeling

Employs Diffusion Transformer with shortcut flow matching

🔎 Similar Papers

No similar papers found.