🤖 AI Summary
This work addresses the scarcity of aligned data and the unreliability of automatic annotations in text-to-music-score generation by proposing a two-stage framework. First, a large language model translates textual prompts into structured, measure-level musical plans; then, a conditional generative model synthesizes ABC notation scores that adhere to these plans. By bypassing noisy or sparse text–music pairings, the approach derives supervision signals directly from symbolic MusicXML data and introduces a multi-dimensional, score-oriented expert evaluation protocol. Experimental results demonstrate that the proposed method significantly outperforms pure LLM-based agents and three end-to-end baselines across key metrics, including playability, readability, instrument appropriateness, structural complexity, and prompt adherence.
📝 Abstract
Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored in text-driven generation. We present Text2Score, a two-stage framework comprising a planning stage and an execution stage for generating sheet music from natural language prompts. By deriving supervision signals directly from symbolic XML data, we propose an alternative training paradigm that bypasses noisy or scarce text-music pairs. In the planning stage, an LLM orchestrator translates a natural language prompt into a structured measure-wise plan defining musical attributes such as instruments, key, time signatures, harmony, etc. This plan is then consumed by a generative model in the execution stage to produce interleaved ABC notation conditioned on the plan's structural constraints. To assess output quality, we introduce an evaluation framework covering playability, readability, instrument utilization, structural complexity, and prompt adherence, validated by expert musicians. Text2Score consistently outperforms both a pure LLM-based agentic framework and three end-to-end baselines across objective and subjective dimensions. We open-source the dataset, code, evaluation set and LLM prompts used in this work; a demo is available on our project page (https://keshavbhandari.github.io/portfolio/text2score).