TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current zero-shot multilingual singing voice synthesis (SVS) models heavily rely on precise phoneme and note boundary annotations, leading to unnatural cross-boundary transitions, poor zero-shot generalization, and limited support for fine-grained, hierarchical style control. To address these limitations, we propose the first customizable framework for zero-shot multilingual SVS: (1) a fuzzy-boundary content encoder to reduce alignment sensitivity; (2) a cross-modal contrastive audio encoder to strengthen semantic–acoustic alignment; and (3) an F0-supervised Cus-MOE streaming Transformer enabling multi-level style modeling—driven jointly by text, pitch, rhythm, and emotion prompts. Extensive objective and subjective evaluations demonstrate significant improvements in phoneme/note transition naturalness, cross-lingual zero-shot generalization, and prompt fidelity. Our method consistently outperforms state-of-the-art baselines across all metrics.

Technology Category

Application Category

📝 Abstract

Customizable multilingual zero-shot singing voice synthesis (SVS) has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. TCSinger 2 mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions. 2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts. 3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that TCSinger 2 outperforms baseline models in both subjective and objective metrics across multiple related tasks.

Problem

Research questions and friction points this paper is trying to address.

Overdependence on phoneme and note boundary annotations in SVS models

Poor transitions between phonemes and notes in zero-shot scenarios

Lack of effective multi-level style control via diverse prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Blurred Boundary Encoder for smooth transitions

Custom Audio Encoder with contrastive learning

Flow-based Custom Transformer with Cus-MOE

🔎 Similar Papers

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control