SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work proposes the first industrial-grade, highly robust zero-shot singing voice synthesis (SVS) system, addressing the prevalent limitations of existing open-source SVS models—namely, poor robustness and inadequate zero-shot generalization in real-world deployment. Trained on over 42,000 hours of multilingual vocal data, the system leverages MIDI or melodic representations as conditioning inputs to enable controllable, cross-lingual, and multi-style singing synthesis. To facilitate reliable evaluation of zero-shot SVS, the authors also introduce SoulX-Singer-Eval, a rigorously disentangled benchmark specifically designed for this purpose. The proposed system achieves state-of-the-art synthesis quality in Mandarin, English, and Cantonese, while demonstrating exceptional zero-shot generalization across diverse musical contexts.

Technology Category

Application Category

📝 Abstract

While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.

Problem

Research questions and friction points this paper is trying to address.

singing voice synthesis

zero-shot

robustness

generalization

industrial deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot singing voice synthesis

controllable singing generation

multi-lingual SVS