ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the high sampling step count and low inference efficiency of diffusion models in voice conversion, this paper proposes a zero-shot voice conversion method based on Rectified Flow (RF). It is the first work to introduce RF into voice conversion, modeling Mel-spectrogram generation as an ordinary differential equation (ODE) that follows the shortest path in latent space—thereby eliminating the iterative denoising process inherent in traditional DDPMs. We further design a dynamic speaker embedding optimization mechanism jointly conditioned on phoneme content and fundamental frequency (F0), enhancing timbre fidelity and few-shot generalization capability. Experiments demonstrate that our method achieves a MOS improvement of over 0.8 in both zero-shot and low-resource settings, while accelerating inference by 5–10× compared to DDPM-based approaches, significantly outperforming existing diffusion-based voice conversion models.

Technology Category

Application Category

📝 Abstract

In recent years, diffusion-based generative models have demonstrated remarkable performance in speech conversion, including Denoising Diffusion Probabilistic Models (DDPM) and others. However, the advantages of these models come at the cost of requiring a large number of sampling steps. This limitation hinders their practical application in real-world scenarios. In this paper, we introduce ReFlow-VC, a novel high-fidelity speech conversion method based on rectified flow. Specifically, ReFlow-VC is an Ordinary Differential Equation (ODE) model that transforms a Gaussian distribution to the true Mel-spectrogram distribution along the most direct path. Furthermore, we propose a modeling approach that optimizes speaker features by utilizing both content and pitch information, allowing speaker features to reflect the properties of the current speech more accurately. Experimental results show that ReFlow-VC performs exceptionally well in small datasets and zero-shot scenarios.

Problem

Research questions and friction points this paper is trying to address.

Reduces sampling steps in diffusion-based voice conversion models

Optimizes speaker features using content and pitch information

Enhances zero-shot voice conversion performance in small datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses rectified flow for direct path conversion

Optimizes speaker features with content and pitch

Performs well in zero-shot and small datasets

🔎 Similar Papers

No similar papers found.