🤖 AI Summary
Existing streaming voice conversion (VC) systems suffer from high latency, reliance on automatic speech recognition (ASR) or speaker separation modules, and degradation in voice similarity and naturalness due to timbre leakage. To address these issues, this paper proposes an end-to-end low-latency streaming VC framework that eliminates explicit content–timbre disentanglement and external recognition modules. Instead, it leverages a pre-trained zero-shot VC model to synthesize high-fidelity parallel data for direct timbre mapping. Built upon a neural audio codec architecture, the framework enables fully end-to-end streaming training and inference. Experiments demonstrate an end-to-end latency of only 77.1 ms, while achieving statistically significant improvements in both naturalness (MOS) and speaker similarity (SIM) over state-of-the-art streaming VC methods. This work establishes a new paradigm for real-time VC—efficient, robust, and free of auxiliary ASR components.
📝 Abstract
Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degraded naturalness. We present SynthVC, a streaming end-to-end VC framework that directly learns speaker timbre transformation from synthetic parallel data generated by a pre-trained zero-shot VC model. This design eliminates the need for explicit content-speaker separation or recognition modules. Built upon a neural audio codec architecture, SynthVC supports low-latency streaming inference with high output fidelity. Experimental results show that SynthVC outperforms baseline streaming VC systems in both naturalness and speaker similarity, achieving an end-to-end latency of just 77.1 ms.