Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
This work addresses the challenge that existing singing voice conversion (SVC) methods struggle to reliably extract clean vocal melodies from accompanied recordings due to harmonic interference. To overcome this limitation, we propose a zero-shot, cross-lingual SVC system that explicitly models both the main melody and residual harmonics—a first in SVC—enabling effective processing of polyphonic audio. The architecture integrates a CQT-based pitch extractor, a stochastic sampler, and a conditional flow-matching diffusion decoder, jointly optimizing pitch, linguistic content, and time–frequency features. Experimental results demonstrate that our approach consistently outperforms current baselines on both harmonically rich and monophonic datasets, achieving superior performance in terms of naturalness, timbre similarity, and harmonic reconstruction fidelity.
📝 Abstract
Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.
Problem

Research questions and friction points this paper is trying to address.

Singing Voice Conversion
Polyphony
Harmonic Modeling
Accompanied Recordings
F0 Extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Polyphony-Aware
Zero-shot SVC
Constant-Q Transform
Conditional Flow Matching
Harmonic Modeling
🔎 Similar Papers
Chen Geng
Chen Geng
Stanford University
4D VisionComputer GraphicsInverse Graphics
M
Meng Chen
Lyra Lab, Tencent Music Entertainment, Shenzhen, China
R
Ruohua Zhou
School of Intelligence Science and Technology, Beijing University of Civil Engineering and Architecture, Beijing, China; Beijing Key Laboratory of Super Intelligent Technology for Urban Architecture, Beijing, China
R
Ruolan Liu
W
Weifeng Zhao
Lyra Lab, Tencent Music Entertainment, Shenzhen, China