GenVC: Self-Supervised Zero-Shot Voice Conversion

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot voice conversion (VC) methods rely on external supervised models to disentangle content and speaker characteristics and typically employ parallel synthesis, limiting speaker similarity and privacy preservation. To address these limitations, GenVC introduces the first fully self-supervised zero-shot VC framework. It achieves unsupervised content–style disentanglement via phoneme-guided self-supervised representation learning—eliminating the need for pre-trained speech recognizers—and adopts an autoregressive generative architecture enabling time-flexible non-parallel synthesis, thereby fully decoupling from source prosody and speaker identity. Without requiring any parallel data or external models, GenVC achieves state-of-the-art speaker similarity, matches mainstream methods in naturalness, and significantly enhances voice anonymization capability and privacy protection.

Technology Category

Application Category

📝 Abstract
Zero-shot voice conversion has recently made substantial progress, but many models still depend on external supervised systems to disentangle speaker identity and linguistic content. Furthermore, current methods often use parallel conversion, where the converted speech inherits the source utterance's temporal structure, restricting speaker similarity and privacy. To overcome these limitations, we introduce GenVC, a generative zero-shot voice conversion model. GenVC learns to disentangle linguistic content and speaker style in a self-supervised manner, eliminating the need for external models and enabling efficient training on large, unlabeled datasets. Experimental results show that GenVC achieves state-of-the-art speaker similarity while maintaining naturalness competitive with leading approaches. Its autoregressive generation also allows the converted speech to deviate from the source utterance's temporal structure. This feature makes GenVC highly effective for voice anonymization, as it minimizes the preservation of source prosody and speaker characteristics, enhancing privacy protection.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot voice conversion
Self-supervised learning
Voice anonymization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised linguistic content disentanglement
Autoregressive generation for temporal deviation
Enhanced privacy via speaker anonymization
🔎 Similar Papers
No similar papers found.