O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional voice conversion (VC) approaches struggle to effectively disentangle speaker identity from linguistic content, often resulting in phonetic information loss. To address this, we propose a one-to-one aligned training paradigm leveraging synthetic data: high-fidelity, pre-trained multi-speaker text-to-speech (TTS) models generate paired utterances—identical in linguistic content but spoken by different speakers—to train an end-to-end VC model. This method avoids explicit feature disentanglement; instead, it implicitly preserves linguistic information and models speaker characteristics in a purely data-driven manner. Consequently, it natively supports zero-shot adaptation to unseen speakers and cross-lingual VC. Evaluated on multiple benchmarks, our approach achieves a 16.35% relative reduction in word error rate (WER) and a 5.91% absolute improvement in speaker similarity score, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Traditional voice conversion (VC) methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these factors remains challenging, often leading to information loss during training. In this paper, we propose a new approach that leverages synthetic speech data generated by a high-quality, pretrained multispeaker text-to-speech (TTS) model. Specifically, synthetic data pairs that share the same linguistic content but differ in speaker identity are used as input-output pairs to train the voice conversion model. This enables the model to learn a direct mapping between source and target voices, effectively capturing speaker-specific characteristics while preserving linguistic content. Additionally, we introduce a flexible training strategy for any-to-any voice conversion that generalizes well to unseen speakers and new languages, enhancing adaptability and performance in zero-shot scenarios. Our experiments show that our proposed method achieves a 16.35% relative reduction in word error rate and a 5.91% improvement in speaker cosine similarity, outperforming several state-of-the-art methods. Voice conversion samples can be accessed at: https://oovc-emnlp-2025.github.io/
Problem

Research questions and friction points this paper is trying to address.

Addresses voice conversion disentanglement challenges using synthetic data
Enables direct mapping between speakers while preserving linguistic content
Enhances zero-shot adaptability for unseen speakers and languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses synthetic speech pairs from TTS model
Learns direct mapping between source and target voices
Enables flexible any-to-any voice conversion training
🔎 Similar Papers
No similar papers found.
H
Huu Tuong Tu
VNPT AI, VNPT Group
H
Huan Vu
Business AI Lab, National Economics University
C
cuong tien nguyen
VNPT AI, VNPT Group
D
Dien Hy Ngo
VNPT AI, VNPT Group
Nguyen Thi Thu Trang
Nguyen Thi Thu Trang
Lecturer & Researcher, School of Information and Communication Technology, Hanoi University of
Speech SynthesisSpeaker RecognitionSpeech TechnologyNatural Language Processing