🤖 AI Summary
Addressing key challenges in singing voice synthesis (SVS) and singing voice conversion (SVC)—including cross-domain modeling difficulty, insufficient musicality, and scarcity of high-quality annotated data—this paper proposes the first voice-reference-driven zero-shot unified framework for SVS/SVC. Methodologically, it employs a pretrained content encoder to extract shared phonetic–singing representations, integrates a diffusion-based generative model trained jointly on hybrid singing/speech data, and introduces a multi-condition controllable decoding mechanism. This enables fully controllable generation of lyrics, pitch, style, and timbre from a single spoken utterance alone. Experiments demonstrate significant improvements over state-of-the-art methods in timbre similarity and musicality; notably, the framework achieves high-fidelity singing voice cloning under zero-shot conditions. By eliminating reliance on target-domain singing data or parallel annotations, it establishes a novel paradigm for low-resource music generation.
📝 Abstract
We propose a unified framework for Singing Voice Synthesis (SVS) and Conversion (SVC), addressing the limitations of existing approaches in cross-domain SVS/SVC, poor output musicality, and scarcity of singing data. Our framework enables control over multiple aspects, including language content based on lyrics, performance attributes based on a musical score, singing style and vocal techniques based on a selector, and voice identity based on a speech sample. The proposed zero-shot learning paradigm consists of one SVS model and two SVC models, utilizing pre-trained content embeddings and a diffusion-based generator. The proposed framework is also trained on mixed datasets comprising both singing and speech audio, allowing singing voice cloning based on speech reference. Experiments show substantial improvements in timbre similarity and musicality over state-of-the-art baselines, providing insights into other low-data music tasks such as instrumental style transfer. Examples can be found at: everyone-can-sing.github.io.