π€ AI Summary
To address the high sampling step count and low inference efficiency of diffusion models in voice conversion, this paper proposes a zero-shot voice conversion method based on Rectified Flow (RF). It is the first work to introduce RF into voice conversion, modeling Mel-spectrogram generation as an ordinary differential equation (ODE) that follows the shortest path in latent spaceβthereby eliminating the iterative denoising process inherent in traditional DDPMs. We further design a dynamic speaker embedding optimization mechanism jointly conditioned on phoneme content and fundamental frequency (F0), enhancing timbre fidelity and few-shot generalization capability. Experiments demonstrate that our method achieves a MOS improvement of over 0.8 in both zero-shot and low-resource settings, while accelerating inference by 5β10Γ compared to DDPM-based approaches, significantly outperforming existing diffusion-based voice conversion models.
π Abstract
In recent years, diffusion-based generative models have demonstrated remarkable performance in speech conversion, including Denoising Diffusion Probabilistic Models (DDPM) and others. However, the advantages of these models come at the cost of requiring a large number of sampling steps. This limitation hinders their practical application in real-world scenarios. In this paper, we introduce ReFlow-VC, a novel high-fidelity speech conversion method based on rectified flow. Specifically, ReFlow-VC is an Ordinary Differential Equation (ODE) model that transforms a Gaussian distribution to the true Mel-spectrogram distribution along the most direct path. Furthermore, we propose a modeling approach that optimizes speaker features by utilizing both content and pitch information, allowing speaker features to reflect the properties of the current speech more accurately. Experimental results show that ReFlow-VC performs exceptionally well in small datasets and zero-shot scenarios.