vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

๐Ÿ“… 2024-09-03
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the challenge of modeling timbre fidelity and controllability in arbitrary-to-arbitrary voice conversion (VC) using discrete-token vocoders. We propose vec2wav 2.0, which frames VC as a reference-prompted conditional waveform generation task: speech content is represented by self-supervised discrete tokens, timbre features are extracted via WavLM, and timbre-aware waveform reconstruction is enabled by an adaptive Snake activation function. Our method is the first to fully embed timbre control within the discrete-token vocoding processโ€”requiring no timbre annotations for training. It also demonstrates, for the first time, that a purely discrete-token vocoder can independently disentangle and manipulate timbre, enabling cross-lingual VC. Experiments show state-of-the-art performance across arbitrary-to-arbitrary VC benchmarks, achieving the highest MOS (audio quality) and speaker similarity (SIM) scores. Notably, competitive cross-lingual conversion is attained using monolingual training data alone.

Technology Category

Application Category

๐Ÿ“ Abstract
We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.
Problem

Research questions and friction points this paper is trying to address.

Improving voice conversion via discrete token vocoders
Addressing speaker timbre loss in content tokens
Enhancing cross-lingual VC with monolingual training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses discrete tokens for voice conversion
Incorporates WavLM for timbre information
Employs adaptive Snake activation function
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yiwei Guo
MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China
Zhihan Li
Zhihan Li
Kuaishou Technology, Tsinghua University
Anomaly DetectionAIOps
J
Junjie Li
MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China
Chenpeng Du
Chenpeng Du
ByteDance
Speech Interaction
Hankun Wang
Hankun Wang
Shanghai Jiao Tong University
Speech Synthesis
S
Shuai Wang
Shenzhen Research Institute of Big Data, Shenzhen, China
X
Xie Chen
MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China
K
Kai Yu
MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China