🤖 AI Summary
Existing lyric editing methods suffer from insufficient controllability in preserving melodic consistency or rely heavily on manual alignment. This work proposes a fully diffusion-based model that enables melody-controllable singing voice synthesis without requiring manual alignment, using only an optional timbre reference, the original vocal snippet, and the edited lyrics. The method is the first to support flexible lyric editing while precisely retaining the original melody. Furthermore, we introduce LyricEditBench, the first benchmark specifically designed for evaluating melody-preserving lyric editing. By integrating a full diffusion architecture with curriculum learning and a grouped relative strategy, our approach outperforms the current strongest baseline, Vevo2, in both melodic fidelity and lyrical alignment. Code, models, the evaluation benchmark, and audio samples are publicly released.
📝 Abstract
Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer, a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes three inputs: an optional timbre reference, a melody-providing singing clip, and modified lyrics, without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, YingMusic-Singer achieves stronger melody preservation and lyric adherence than Vevo2, the most comparable baseline supporting melody control without manual alignment. We also introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation. The code, weights, benchmark, and demos are publicly available at https://github.com/ASLP-lab/YingMusic-Singer.