SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing streaming voice conversion (VC) systems suffer from high latency, reliance on automatic speech recognition (ASR) or speaker separation modules, and degradation in voice similarity and naturalness due to timbre leakage. To address these issues, this paper proposes an end-to-end low-latency streaming VC framework that eliminates explicit content–timbre disentanglement and external recognition modules. Instead, it leverages a pre-trained zero-shot VC model to synthesize high-fidelity parallel data for direct timbre mapping. Built upon a neural audio codec architecture, the framework enables fully end-to-end streaming training and inference. Experiments demonstrate an end-to-end latency of only 77.1 ms, while achieving statistically significant improvements in both naturalness (MOS) and speaker similarity (SIM) over state-of-the-art streaming VC methods. This work establishes a new paradigm for real-time VC—efficient, robust, and free of auxiliary ASR components.

Technology Category

Application Category

📝 Abstract
Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degraded naturalness. We present SynthVC, a streaming end-to-end VC framework that directly learns speaker timbre transformation from synthetic parallel data generated by a pre-trained zero-shot VC model. This design eliminates the need for explicit content-speaker separation or recognition modules. Built upon a neural audio codec architecture, SynthVC supports low-latency streaming inference with high output fidelity. Experimental results show that SynthVC outperforms baseline streaming VC systems in both naturalness and speaker similarity, achieving an end-to-end latency of just 77.1 ms.
Problem

Research questions and friction points this paper is trying to address.

Achieving real-time streaming voice conversion with low latency
Eliminating dependency on ASR modules and complex speaker disentanglement
Preventing timbre leakage while maintaining output naturalness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses synthetic parallel data for training
Eliminates content-speaker separation modules
Enables low-latency streaming with neural codec
🔎 Similar Papers
No similar papers found.
Z
Zhao Guo
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China
Z
Ziqian Ning
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China
Guobin Ma
Guobin Ma
Northwestern Polytechnical University
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China