🤖 AI Summary
This work addresses the challenge that existing singing voice conversion (SVC) methods struggle to reliably extract clean vocal melodies from accompanied recordings due to harmonic interference. To overcome this limitation, we propose a zero-shot, cross-lingual SVC system that explicitly models both the main melody and residual harmonics—a first in SVC—enabling effective processing of polyphonic audio. The architecture integrates a CQT-based pitch extractor, a stochastic sampler, and a conditional flow-matching diffusion decoder, jointly optimizing pitch, linguistic content, and time–frequency features. Experimental results demonstrate that our approach consistently outperforms current baselines on both harmonically rich and monophonic datasets, achieving superior performance in terms of naturalness, timbre similarity, and harmonic reconstruction fidelity.
📝 Abstract
Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.