🤖 AI Summary
To address the challenge of high-quality, multi-style singing voice synthesis (SVS) under zero-shot conditions, this paper proposes the first cross-lingual and cross-domain zero-shot SVS framework. Methodologically, we design a clustering-based style encoder to achieve disentangled style representation in latent space; construct a joint language model for style and phoneme duration to enhance temporal consistency; and introduce a mel-spectrogram–style adaptive normalization decoder to improve acoustic modeling fidelity. Key technical contributions include clustered vector quantization, mel–style co-normalization, and fine-grained style control over vocal techniques, emotion, rhythm, articulation, and timbre. Experiments demonstrate that our method significantly outperforms existing baselines on zero-shot style transfer, multi-level controllable synthesis, cross-lingual SVS, and speech-to-singing conversion—achieving state-of-the-art performance in synthesis quality, singer similarity, and style controllability.
📝 Abstract
Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, TCSinger proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S&D-LM) concurrently predicts style information and phoneme duration, which benefits both; 3) the style adaptive decoder uses a novel mel-style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that TCSinger outperforms all baseline models in synthesis quality, singer similarity, and style controllability across various tasks, including zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer.