vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

📅 2024-09-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study addresses the challenge of modeling timbre fidelity and controllability in arbitrary-to-arbitrary voice conversion (VC) using discrete-token vocoders. We propose vec2wav 2.0, which frames VC as a reference-prompted conditional waveform generation task: speech content is represented by self-supervised discrete tokens, timbre features are extracted via WavLM, and timbre-aware waveform reconstruction is enabled by an adaptive Snake activation function. Our method is the first to fully embed timbre control within the discrete-token vocoding process—requiring no timbre annotations for training. It also demonstrates, for the first time, that a purely discrete-token vocoder can independently disentangle and manipulate timbre, enabling cross-lingual VC. Experiments show state-of-the-art performance across arbitrary-to-arbitrary VC benchmarks, achieving the highest MOS (audio quality) and speaker similarity (SIM) scores. Notably, competitive cross-lingual conversion is attained using monolingual training data alone.

Technology Category

Application Category

📝 Abstract

We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.

Problem

Research questions and friction points this paper is trying to address.

Improving voice conversion via discrete token vocoders

Addressing speaker timbre loss in content tokens

Enhancing cross-lingual VC with monolingual training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses discrete tokens for voice conversion

Incorporates WavLM for timbre information

Employs adaptive Snake activation function

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs