🤖 AI Summary
Diffusion-based voice conversion models (e.g., VoiceGrad) suffer from slow inference due to iterative sampling; although FastVoiceGrad achieves one-step generation, its reliance on a computationally intensive content encoder hinders end-to-end efficiency. This paper proposes Adversarial Diffusion Conversion Distillation (ADCD), a novel framework that jointly distills both the diffusion generator and the content encoder within a single-step diffusion paradigm, simultaneously optimizing inference speed and acoustic representation disentanglement. ADCD employs cooperative adversarial distillation and score distillation—eliminating explicit content modeling while preserving speaker identity and prosodic fidelity. Evaluated on few-shot voice conversion, ADCD achieves 6.6–6.9× GPU acceleration and 1.8× CPU acceleration over FastVoiceGrad, with comparable speech quality and speaker similarity. To our knowledge, this is the first method enabling efficient, high-fidelity, content-encoder-free one-step voice conversion.
📝 Abstract
A diffusion-based voice conversion (VC) model (e.g., VoiceGrad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastVoiceGrad overcomes this limitation by distilling VoiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterVoiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterVoiceGrad achieves competitive VC performance compared to FastVoiceGrad, with 6.6-6.9 and 1.8 times faster speed on a GPU and CPU, respectively.