🤖 AI Summary
Existing symbolic music generation research primarily addresses isolated subtasks—such as lyric generation or melody transformation—lacking end-to-end frameworks that jointly model lyrics and melody. This paper proposes the first instruction-driven, lyric-melody co-generation model. Our method introduces a word-level aligned tuple representation, initializes a music-knowledge-guided note tokenizer, and models melody structure in three hierarchical stages: motive → phrase → section. We employ a music-specialized large language model, an expanded note vocabulary, and rhythm-aware scalar initialization. Evaluated across lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song synthesis, our model consistently outperforms GPT-4. To foster reproducibility and further research, we publicly release SongCompose—a bilingual (Chinese–English) paired dataset of lyrics and melodies.
📝 Abstract
Creating lyrics and melodies for the vocal track in a symbolic format, known as song composition, demands expert musical knowledge of melody, an advanced understanding of lyrics, and precise alignment between them. Despite achievements in sub-tasks such as lyric generation, lyric-to-melody, and melody-to-lyric, etc, a unified model for song composition has not yet been achieved. In this paper, we introduce SongComposer, a pioneering step towards a unified song composition model that can readily create symbolic lyrics and melodies following instructions. SongComposer is a music-specialized large language model (LLM) that, for the first time, integrates the capability of simultaneously composing lyrics and melodies into LLMs by leveraging three key innovations: 1) a flexible tuple format for word-level alignment of lyrics and melodies, 2) an extended tokenizer vocabulary for song notes, with scalar initialization based on musical knowledge to capture rhythm, and 3) a multi-stage pipeline that captures musical structure, starting with motif-level melody patterns and progressing to phrase-level structure for improved coherence. Extensive experiments demonstrate that SongComposer outperforms advanced LLMs, including GPT-4, in tasks such as lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation. Moreover, we will release SongCompose, a large-scale dataset for training, containing paired lyrics and melodies in Chinese and English.