SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing streaming voice conversion (VC) systems suffer from high latency, reliance on automatic speech recognition (ASR) or speaker separation modules, and degradation in voice similarity and naturalness due to timbre leakage. To address these issues, this paper proposes an end-to-end low-latency streaming VC framework that eliminates explicit content–timbre disentanglement and external recognition modules. Instead, it leverages a pre-trained zero-shot VC model to synthesize high-fidelity parallel data for direct timbre mapping. Built upon a neural audio codec architecture, the framework enables fully end-to-end streaming training and inference. Experiments demonstrate an end-to-end latency of only 77.1 ms, while achieving statistically significant improvements in both naturalness (MOS) and speaker similarity (SIM) over state-of-the-art streaming VC methods. This work establishes a new paradigm for real-time VC—efficient, robust, and free of auxiliary ASR components.

Technology Category

Application Category

📝 Abstract

Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degraded naturalness. We present SynthVC, a streaming end-to-end VC framework that directly learns speaker timbre transformation from synthetic parallel data generated by a pre-trained zero-shot VC model. This design eliminates the need for explicit content-speaker separation or recognition modules. Built upon a neural audio codec architecture, SynthVC supports low-latency streaming inference with high output fidelity. Experimental results show that SynthVC outperforms baseline streaming VC systems in both naturalness and speaker similarity, achieving an end-to-end latency of just 77.1 ms.

Problem

Research questions and friction points this paper is trying to address.

Achieving real-time streaming voice conversion with low latency

Eliminating dependency on ASR modules and complex speaker disentanglement

Preventing timbre leakage while maintaining output naturalness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses synthetic parallel data for training

Eliminates content-speaker separation modules

Enables low-latency streaming with neural codec

🔎 Similar Papers

No similar papers found.