StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

📅 2023-12-17

🏛️ AAAI Conference on Artificial Intelligence

📈 Citations: 11

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This paper addresses out-of-domain (OOD) zero-shot singing voice style transfer—specifically, transferring singing styles from unseen reference vocals encompassing timbre, emotion, articulation, and vocal technique. Methodologically, we propose the first end-to-end framework featuring: (i) a Residual Style Adapter (RSA) that explicitly models multi-dimensional singing style; (ii) Uncertainty-aware Modulated Layer Normalization (UMLN) to enhance cross-domain generalization; and (iii) an integrated design combining residual vector quantization, reference-driven disentangled style encoding, and neural acoustic modeling. Experiments demonstrate substantial improvements over state-of-the-art baselines in zero-shot transfer: MOS scores increase by over 1.2, cosine similarity for style fidelity improves by 23%, and both audio naturalness and expressiveness are significantly enhanced. Crucially, our approach eliminates reliance on target attributes observed during training—a fundamental limitation of conventional singing voice synthesis (SVS) systems.

📝 Abstract

Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Access to singing voice samples can be found at https://stylesinger.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Style transfer for singing synthesis

Overcoming out-of-domain quality decline

Zero-shot style transfer enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Residual Style Adaptor captures styles

Uncertainty Modeling Layer Normalization improves generalization

Zero-shot style transfer for singing synthesis

🔎 Similar Papers

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control