LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation

📅 2023-07-05
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Machine-generated lyrics often suffer from poor singability due to misaligned rhythmic phrasing, inaccurate line counts, and inconsistent syllable numbers per line. To address this, we propose a melody-to-lyrics joint generation framework that introduces, for the first time, a format-aware training objective—explicitly modeling musicological constraints (e.g., meter, phrase structure) as fine-grained melody–lyric alignment penalties. Our method employs a two-stage pretraining strategy: initially infusing length and rhythm awareness into large language models using pure lyric corpora, then optimizing with a music-driven format alignment loss. Evaluated on standard benchmarks, our approach achieves absolute improvements of 3.75% and 21.44% in line-count and syllable-per-line accuracy, respectively. In both objective metrics and human evaluations, it outperforms state-of-the-art methods by 63.92% and 74.18% on melody–lyric compatibility and overall quality, significantly narrowing the singability gap between human-composed and AI-generated lyrics.
📝 Abstract
Despite previous efforts in melody-to-lyric generation research, there is still a significant compatibility gap between generated lyrics and melodies, negatively impacting the singability of the outputs. This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And Formatting during Melody-to-Lyric training. After general-domain pretraining, our proposed model acquires length awareness first from a large text-only lyric corpus. Then, we introduce a new objective informed by musicological research on the relationship between melody and lyrics during melody-to-lyric training, which enables the model to learn the fine-grained format requirements of the melody. Our model achieves 3.75% and 21.44% absolute accuracy gains in the outputs' number-of-line and syllable-per-line requirements compared to naive fine-tuning, without sacrificing text fluency. Furthermore, our model demonstrates a 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation, compared to the state-of-the-art melody-to-lyric generation model, highlighting the significance of formatting learning.
Problem

Research questions and friction points this paper is trying to address.

Jointly learns wording and formatting for singable melody-to-lyric generation
Reduces singability gap by capturing prosodic and structural patterns
Improves adherence to line and syllable counts without degrading text quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly learns wording and formatting for melody-to-lyric generation.
Uses self-supervised pretraining on large lyric corpus for length awareness.
Incorporates musicological auxiliary objectives to capture prosodic patterns.
🔎 Similar Papers
No similar papers found.
Longshen Ou
Longshen Ou
National University of Singapore
Music Information RetrievalAudio ProcessingNatural Language Processing
X
Xichu Ma
School of Computing, National University of Singapore, 21 Lower Kent Ridge Road, Singapore
Y
Ye Wang
School of Computing, National University of Singapore, 21 Lower Kent Ridge Road, Singapore