🤖 AI Summary
To address acoustic information loss and high computational overhead caused by voice-content disentanglement in zero-shot singing voice conversion (SVC) under low-resource settings, this paper proposes an efficient joint modeling framework for content and timbre. Methodologically, we design a coupled encoder-decoder architecture that simultaneously extracts linguistic content and singer-specific timbral features; incorporate pitch- and loudness-aware modeling to preserve expressive vocal performance; and integrate conditional waveform generation with diffusion-based modeling for end-to-end high-fidelity reconstruction—introducing, for the first time in zero-shot SVC, embedded speech super-resolution capability. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in conversion quality, naturalness, and inference efficiency, particularly yielding marked improvements in audio fidelity under data-scarce conditions, while concurrently achieving competitive super-resolution performance.
📝 Abstract
Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.