MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This paper addresses zero-shot voice conversion (VC) with fine-grained, multi-factor controllability over speaker identity, linguistic content, and prosody. Methodologically, it introduces the first unified masked speech encoder-decoder Transformer architecture incorporating multi-path classifier-free guidance (CFG), enabling joint conditional modeling of continuous/discrete linguistic features, pitch contours, and accent attributes. It further proposes a hybrid linguistic representation scheme that fuses quantized and continuous language embeddings, optionally augmented with pitch-guided conditioning to improve prosodic control fidelity. Experimental results demonstrate that the proposed model significantly outperforms existing baselines in target speaker similarity and accent matching, while achieving word/character error rates comparable to the best-performing baseline—thereby jointly attaining high naturalness and precise, interpretable controllability across speaker, linguistic, and prosodic dimensions.

Technology Category

Application Category

📝 Abstract

We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intellgibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Achieving zero-shot voice conversion with multi-factor controllability

Balancing speaker identity, linguistic content, and prosodic factors

Enhancing intelligibility and speaker similarity in voice conversion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Voice Codec Transformer for zero-shot conversion

Multiple classifier-free guidances for multi-factor controllability

Leverages continuous or quantized linguistic and pitch features

🔎 Similar Papers

No similar papers found.