LatentVoiceGrad: Nonparallel Voice Conversion with Latent Diffusion/Flow-Matching Models

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the suboptimal audio quality and slow inference speed of VoiceGrad in non-parallel voice conversion, this paper proposes a novel framework integrating Latent Diffusion Models (LDMs) with Flow Matching (FM). Methodologically, we employ an LDM to model the latent space distribution at the bottleneck of a pretrained autoencoder and replace the conventional reverse diffusion process with FM-based sampling, yielding more efficient and stable generation. To our knowledge, this is the first work to synergistically combine LDMs and FM for voice conversion. Experiments demonstrate significant improvements: a 0.42-point MOS gain (+12.3%) and a real-time factor (RTF) of 0.18—5.6× faster than VoiceGrad—thereby achieving a compelling trade-off between speech fidelity and inference efficiency.

Technology Category

Application Category

📝 Abstract
Previously, we introduced VoiceGrad, a nonparallel voice conversion (VC) technique enabling mel-spectrogram conversion from source to target speakers using a score-based diffusion model. The concept involves training a score network to predict the gradient of the log density of mel-spectrograms from various speakers. VC is executed by iteratively adjusting an input mel-spectrogram until resembling the target speaker's. However, challenges persist: audio quality needs improvement, and conversion is slower compared to modern VC methods designed to operate at very high speeds. To address these, we introduce latent diffusion models into VoiceGrad, proposing an improved version with reverse diffusion in the autoencoder bottleneck. Additionally, we propose using a flow matching model as an alternative to the diffusion model to further speed up the conversion process without compromising the conversion quality. Experimental results show enhanced speech quality and accelerated conversion compared to the original.
Problem

Research questions and friction points this paper is trying to address.

Improving audio quality in nonparallel voice conversion
Accelerating conversion speed for voice synthesis
Maintaining quality while using latent diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent diffusion models in autoencoder bottleneck
Flow matching model for faster conversion
Improved speech quality and conversion speed
🔎 Similar Papers
No similar papers found.