Generating Novel and Realistic Speakers for Voice Conversion

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing voice conversion (VC) methods rely on target speaker data, limiting their applicability to zero-shot or entirely novel timbre generation. To address this, we propose SpeakerVAE—a lightweight, deep hierarchical variational autoencoder—that for the first time constructs a learnable, disentangled speaker timbre latent space amenable to direct sampling. SpeakerVAE enables plug-and-play timbre synthesis without requiring target speaker data or fine-tuning. It is fully compatible with mainstream VC frameworks such as FACodec and CosyVoice2, and supports high-fidelity synthetic speaker generation via latent-space sampling. Experiments demonstrate that generated voices achieve competitive perceptual quality—scoring 3.82 in MOS and 0.91 in speaker similarity (SIM)—on par with real speakers and significantly surpassing existing zero-shot VC baselines. By enabling scalable, generative speaker modeling, SpeakerVAE establishes a new paradigm for open-domain voice conversion.

Technology Category

Application Category

📝 Abstract
Voice conversion models modify timbre while preserving paralinguistic features, enabling applications like dubbing and identity protection. However, most VC systems require access to target utterances, limiting their use when target data is unavailable or when users desire conversion to entirely novel, unseen voices. To address this, we introduce a lightweight method SpeakerVAE to generate novel speakers for VC. Our approach uses a deep hierarchical variational autoencoder to model the speaker timbre space. By sampling from the trained model, we generate novel speaker representations for voice synthesis in a VC pipeline. The proposed method is a flexible plug-in module compatible with various VC models, without co-training or fine-tuning of the base VC system. We evaluated our approach with state-of-the-art VC models: FACodec and CosyVoice2. The results demonstrate that our method successfully generates novel, unseen speakers with quality comparable to that of the training speakers.
Problem

Research questions and friction points this paper is trying to address.

Generating novel unseen speakers for voice conversion
Eliminating dependency on target speaker data
Creating flexible plug-in module for VC systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates novel speakers using hierarchical variational autoencoder
Lightweight plug-in module compatible with existing VC systems
Creates unseen speaker representations without target utterances
🔎 Similar Papers
No similar papers found.
M
Meiying Melissa Chen
Department of Electrical and Computer Engineering, University of Rochester, Rochester NY , USA
Z
Zhenyu Wang
Department of Electrical and Computer Engineering, University of Rochester, Rochester NY , USA
Zhiyao Duan
Zhiyao Duan
Professor of Electrical and Computer Engineering, University of Rochester
Computer AuditionMusic Information RetrievalSpeech ProcessingAudiovisual Learning