🤖 AI Summary
Machine-generated lyrics often suffer from poor singability due to misaligned rhythmic phrasing, inaccurate line counts, and inconsistent syllable numbers per line. To address this, we propose a melody-to-lyrics joint generation framework that introduces, for the first time, a format-aware training objective—explicitly modeling musicological constraints (e.g., meter, phrase structure) as fine-grained melody–lyric alignment penalties. Our method employs a two-stage pretraining strategy: initially infusing length and rhythm awareness into large language models using pure lyric corpora, then optimizing with a music-driven format alignment loss. Evaluated on standard benchmarks, our approach achieves absolute improvements of 3.75% and 21.44% in line-count and syllable-per-line accuracy, respectively. In both objective metrics and human evaluations, it outperforms state-of-the-art methods by 63.92% and 74.18% on melody–lyric compatibility and overall quality, significantly narrowing the singability gap between human-composed and AI-generated lyrics.
📝 Abstract
Despite previous efforts in melody-to-lyric generation research, there is still a significant compatibility gap between generated lyrics and melodies, negatively impacting the singability of the outputs. This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And Formatting during Melody-to-Lyric training. After general-domain pretraining, our proposed model acquires length awareness first from a large text-only lyric corpus. Then, we introduce a new objective informed by musicological research on the relationship between melody and lyrics during melody-to-lyric training, which enables the model to learn the fine-grained format requirements of the melody. Our model achieves 3.75% and 21.44% absolute accuracy gains in the outputs' number-of-line and syllable-per-line requirements compared to naive fine-tuning, without sacrificing text fluency. Furthermore, our model demonstrates a 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation, compared to the state-of-the-art melody-to-lyric generation model, highlighting the significance of formatting learning.