DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Data scarcity and limited model scalability hinder the development of diffusion-based singing synthesis. To address these challenges, this work proposes a two-stage solution: First, a high-quality Chinese singing dataset exceeding 500 hours is constructed by leveraging large language models (LLMs) to generate diverse lyrics aligned with fixed melodies. Second, DiTSinger—a novel diffusion singing synthesizer—is introduced, integrating a Diffusion Transformer architecture with an implicit alignment mechanism that eliminates reliance on phoneme-level duration annotations; instead, character-level speech attention constraints enhance alignment robustness. Furthermore, the model’s depth, width, and feature resolution are systematically scaled to improve representational capacity. Experiments demonstrate stable training and high-fidelity synthesis even without precise alignment labels, achieving significant improvements over existing diffusion-based methods in scalability, robustness, and audio quality.

Technology Category

Application Category

📝 Abstract
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity in singing voice synthesis through scalable generation
Enhances model scalability using diffusion transformer with systematic scaling
Eliminates dependency on phoneme duration labels via implicit alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer scales depth width resolution
Implicit alignment removes phoneme duration labels
Two-stage pipeline synthesizes large singing dataset
🔎 Similar Papers
No similar papers found.
Z
Zongcai Du
Migu Music, China Mobile Communications Corporation, China
G
Guilin Deng
Migu Music, China Mobile Communications Corporation, China
Xiaofeng Guo
Xiaofeng Guo
PhD student, Robotics Institute, Carnegie Mellon University
roboticsmobile manipulationtactile sensinglearning and control
X
Xin Gao
Migu Music, China Mobile Communications Corporation, China
L
Linke Li
Migu Music, China Mobile Communications Corporation, China
K
Kaichang Cheng
Migu Music, China Mobile Communications Corporation, China
F
Fubo Han
Migu Music, China Mobile Communications Corporation, China
S
Siyu Yang
Migu Music, China Mobile Communications Corporation, China
P
Peng Liu
Migu Music, China Mobile Communications Corporation, China
P
Pan Zhong
Migu Music, China Mobile Communications Corporation, China
Q
Qiang Fu
Migu Music, China Mobile Communications Corporation, China