YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing lyric editing methods suffer from insufficient controllability in preserving melodic consistency or rely heavily on manual alignment. This work proposes a fully diffusion-based model that enables melody-controllable singing voice synthesis without requiring manual alignment, using only an optional timbre reference, the original vocal snippet, and the edited lyrics. The method is the first to support flexible lyric editing while precisely retaining the original melody. Furthermore, we introduce LyricEditBench, the first benchmark specifically designed for evaluating melody-preserving lyric editing. By integrating a full diffusion architecture with curriculum learning and a grouped relative strategy, our approach outperforms the current strongest baseline, Vevo2, in both melodic fidelity and lyrical alignment. Code, models, the evaluation benchmark, and audio samples are publicly released.

Technology Category

Application Category

📝 Abstract
Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer, a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes three inputs: an optional timbre reference, a melody-providing singing clip, and modified lyrics, without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, YingMusic-Singer achieves stronger melody preservation and lyric adherence than Vevo2, the most comparable baseline supporting melody control without manual alignment. We also introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation. The code, weights, benchmark, and demos are publicly available at https://github.com/ASLP-lab/YingMusic-Singer.
Problem

Research questions and friction points this paper is trying to address.

singing voice synthesis
lyric manipulation
melody preservation
controllable generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-based singing voice synthesis
melody-preserving lyric editing
annotation-free melody guidance
curriculum learning
LyricEditBench
🔎 Similar Papers
No similar papers found.
C
Chunbo Hao
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China
J
Junjie Zheng
AI Lab, GiantNetwork, China
Guobin Ma
Guobin Ma
Northwestern Polytechnical University
Yuepeng Jiang
Yuepeng Jiang
Northwestern Polytechnical University
Speech ProcessingSpeech SynthesisVoice Conversion
H
Huakang Chen
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China
Wenjie Tian
Wenjie Tian
Northwest Polytechnical University
speech generation
G
Gongyu Chen
AI Lab, GiantNetwork, China
Z
Zihao Chen
AI Lab, GiantNetwork, China
Lei Xie
Lei Xie
Northwestern Polytechnical University
speech processingspeech recognitionspeech synthesismultimediaartificial intelligence