HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address acoustic information loss and high computational overhead caused by voice-content disentanglement in zero-shot singing voice conversion (SVC) under low-resource settings, this paper proposes an efficient joint modeling framework for content and timbre. Methodologically, we design a coupled encoder-decoder architecture that simultaneously extracts linguistic content and singer-specific timbral features; incorporate pitch- and loudness-aware modeling to preserve expressive vocal performance; and integrate conditional waveform generation with diffusion-based modeling for end-to-end high-fidelity reconstruction—introducing, for the first time in zero-shot SVC, embedded speech super-resolution capability. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in conversion quality, naturalness, and inference efficiency, particularly yielding marked improvements in audio fidelity under data-scarce conditions, while concurrently achieving competitive super-resolution performance.

Technology Category

Application Category

📝 Abstract

Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

Problem

Research questions and friction points this paper is trying to address.

Achieving high-quality zero-shot singing voice conversion with limited data resources

Overcoming acoustic information loss from separate speaker-content modeling approaches

Enhancing output fidelity while maintaining computational efficiency in voice conversion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts content and speaker features using decoupled codec

Enhances fidelity through pitch and volume modeling

Refines outputs via differentiable signal processing and diffusion

🔎 Similar Papers

No similar papers found.