HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address acoustic information loss and high computational overhead caused by voice-content disentanglement in zero-shot singing voice conversion (SVC) under low-resource settings, this paper proposes an efficient joint modeling framework for content and timbre. Methodologically, we design a coupled encoder-decoder architecture that simultaneously extracts linguistic content and singer-specific timbral features; incorporate pitch- and loudness-aware modeling to preserve expressive vocal performance; and integrate conditional waveform generation with diffusion-based modeling for end-to-end high-fidelity reconstruction—introducing, for the first time in zero-shot SVC, embedded speech super-resolution capability. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in conversion quality, naturalness, and inference efficiency, particularly yielding marked improvements in audio fidelity under data-scarce conditions, while concurrently achieving competitive super-resolution performance.

Technology Category

Application Category

📝 Abstract
Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.
Problem

Research questions and friction points this paper is trying to address.

Achieving high-quality zero-shot singing voice conversion with limited data resources
Overcoming acoustic information loss from separate speaker-content modeling approaches
Enhancing output fidelity while maintaining computational efficiency in voice conversion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts content and speaker features using decoupled codec
Enhances fidelity through pitch and volume modeling
Refines outputs via differentiable signal processing and diffusion
🔎 Similar Papers
No similar papers found.
Bingsong Bai
Bingsong Bai
北京邮电大学
Text-to-speechVoice ConversionSpeech Processing
Yizhong Geng
Yizhong Geng
Beijing University of Posts and Telecommunications
TTSVCMultimodal
F
Fengping Wang
Beijing University of Posts and Telecommunications, China
C
Cong Wang
Beijing University of Posts and Telecommunications, China
P
Puyuan Guo
Beijing University of Posts and Telecommunications, China
Yingming Gao
Yingming Gao
Beijing University of Posts and Telecommunications
Computer Assisted Language LearningAcoustic Phonetics and Speech Synthesis
Y
Ya Li
Beijing University of Posts and Telecommunications, China