AdaptVC: High Quality Voice Conversion with Adaptive Learning

📅 2025-01-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

In zero-shot voice conversion, simultaneously preserving linguistic content fidelity and achieving high speaker similarity for unseen target speakers remains challenging. This paper proposes a disentangled zero-shot voice conversion framework: first, an adapter dynamically modulates self-supervised speech representations (Wav2Vec 2.0) to explicitly decouple content and timbre embeddings; second, a conditional flow-matching decoder—augmented with cross-attention over speaker embeddings—is introduced to robustly model speaker identity, significantly improving generalization to unseen speakers. Experiments demonstrate state-of-the-art performance: a mean opinion score (MOS) of 4.12, an 18.7% improvement in speaker similarity, and consistent superiority across objective metrics—including STOI and SIM—over existing methods. The proposed framework establishes a new paradigm for high-fidelity, zero-shot voice conversion.

Technology Category

Application Category

📝 Abstract

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.

Problem

Research questions and friction points this paper is trying to address.

Speech Conversion

Style Retention

Unseen Scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

AdaptVC method

voice conversion

style preservation

🔎 Similar Papers

No similar papers found.