FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based voice conversion models (e.g., VoiceGrad) suffer from slow inference due to iterative sampling; although FastVoiceGrad achieves one-step generation, its reliance on a computationally intensive content encoder hinders end-to-end efficiency. This paper proposes Adversarial Diffusion Conversion Distillation (ADCD), a novel framework that jointly distills both the diffusion generator and the content encoder within a single-step diffusion paradigm, simultaneously optimizing inference speed and acoustic representation disentanglement. ADCD employs cooperative adversarial distillation and score distillation—eliminating explicit content modeling while preserving speaker identity and prosodic fidelity. Evaluated on few-shot voice conversion, ADCD achieves 6.6–6.9× GPU acceleration and 1.8× CPU acceleration over FastVoiceGrad, with comparable speech quality and speaker similarity. To our knowledge, this is the first method enabling efficient, high-fidelity, content-encoder-free one-step voice conversion.

Technology Category

Application Category

📝 Abstract
A diffusion-based voice conversion (VC) model (e.g., VoiceGrad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastVoiceGrad overcomes this limitation by distilling VoiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterVoiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterVoiceGrad achieves competitive VC performance compared to FastVoiceGrad, with 6.6-6.9 and 1.8 times faster speed on a GPU and CPU, respectively.
Problem

Research questions and friction points this paper is trying to address.

Accelerating diffusion-based voice conversion speed
Reducing computational cost of content encoding
Maintaining quality while enabling one-step conversion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial diffusion conversion distillation
One-step diffusion model
Simultaneous content encoder distillation
🔎 Similar Papers
No similar papers found.