SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

๐Ÿ“… 2026-02-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes the first industrial-grade, highly robust zero-shot singing voice synthesis (SVS) system, addressing the prevalent limitations of existing open-source SVS modelsโ€”namely, poor robustness and inadequate zero-shot generalization in real-world deployment. Trained on over 42,000 hours of multilingual vocal data, the system leverages MIDI or melodic representations as conditioning inputs to enable controllable, cross-lingual, and multi-style singing synthesis. To facilitate reliable evaluation of zero-shot SVS, the authors also introduce SoulX-Singer-Eval, a rigorously disentangled benchmark specifically designed for this purpose. The proposed system achieves state-of-the-art synthesis quality in Mandarin, English, and Cantonese, while demonstrating exceptional zero-shot generalization across diverse musical contexts.

Technology Category

Application Category

๐Ÿ“ Abstract
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.
Problem

Research questions and friction points this paper is trying to address.

singing voice synthesis
zero-shot
robustness
generalization
industrial deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot singing voice synthesis
controllable singing generation
multi-lingual SVS
MIDI-conditioned synthesis
zero-shot evaluation benchmark
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jiale Qian
Soul AI Lab, China
H
Hao Meng
Soul AI Lab, China
T
Tian Zheng
Soul AI Lab, China
Pengcheng Zhu
Pengcheng Zhu
Fuxi AI Lab, NetEase Inc.
speech synthesissinging voice synthesistalking avatarvoice conversion
H
Haopeng Lin
Soul AI Lab, China
Y
Yuhang Dai
Soul AI Lab, China
Hanke Xie
Hanke Xie
Northwestern Polytechnical University
Audio speech synthesis
W
Wenxiao Cao
Soul AI Lab, China
R
Ruixuan Shang
Soul AI Lab, China
J
Jun Wu
Soul AI Lab, China
H
Hongmei Liu
Soul AI Lab, China
H
Hanlin Wen
Soul AI Lab, China
J
Jian Zhao
AI Center, Geely Automobile Research Institute (Ningbo) Co., Ltd., Ningbo, China
Z
Zhonglin Jiang
AI Center, Geely Automobile Research Institute (Ningbo) Co., Ltd., Ningbo, China
Y
Yong Chen
AI Center, Geely Automobile Research Institute (Ningbo) Co., Ltd., Ningbo, China
S
Shunshun Yin
Soul AI Lab, China
M
Ming Tao
Soul AI Lab, China
Jianguo Wei
Jianguo Wei
Tianjin university
Speech ProductionSpeech ProcessingArtificial medical intelligence
Lei Xie
Lei Xie
Northwestern Polytechnical University
speech processingspeech recognitionspeech synthesismultimediaartificial intelligence
Xinsheng Wang
Xinsheng Wang
Hong Kong University of Science and Technology (HKUST)
speech synthesissinging voice synthesisvoice conversion