🤖 AI Summary
In zero-shot voice conversion, simultaneously preserving linguistic content fidelity and achieving high speaker similarity for unseen target speakers remains challenging. This paper proposes a disentangled zero-shot voice conversion framework: first, an adapter dynamically modulates self-supervised speech representations (Wav2Vec 2.0) to explicitly decouple content and timbre embeddings; second, a conditional flow-matching decoder—augmented with cross-attention over speaker embeddings—is introduced to robustly model speaker identity, significantly improving generalization to unseen speakers. Experiments demonstrate state-of-the-art performance: a mean opinion score (MOS) of 4.12, an 18.7% improvement in speaker similarity, and consistent superiority across objective metrics—including STOI and SIM—over existing methods. The proposed framework establishes a new paradigm for high-fidelity, zero-shot voice conversion.
📝 Abstract
The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.