๐ค AI Summary
This study addresses the challenge of modeling timbre fidelity and controllability in arbitrary-to-arbitrary voice conversion (VC) using discrete-token vocoders. We propose vec2wav 2.0, which frames VC as a reference-prompted conditional waveform generation task: speech content is represented by self-supervised discrete tokens, timbre features are extracted via WavLM, and timbre-aware waveform reconstruction is enabled by an adaptive Snake activation function. Our method is the first to fully embed timbre control within the discrete-token vocoding processโrequiring no timbre annotations for training. It also demonstrates, for the first time, that a purely discrete-token vocoder can independently disentangle and manipulate timbre, enabling cross-lingual VC. Experiments show state-of-the-art performance across arbitrary-to-arbitrary VC benchmarks, achieving the highest MOS (audio quality) and speaker similarity (SIM) scores. Notably, competitive cross-lingual conversion is attained using monolingual training data alone.
๐ Abstract
We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.