Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Controllable symbolic music generation is hindered by the scarcity of large-scale, high-quality datasets annotated with rich metadata (e.g., instrumentation, style, composer) and titles. Method: We introduce MetaScore—the first 963K-sample symbolic music dataset with fine-grained annotations—and propose a novel LLM-augmented data construction paradigm: leveraging large language models to generate pseudo-natural-language descriptions for scores, integrated with REMI/ABC representations and metadata-driven alignment. Based on MetaScore, we train dual-path conditional generative models—diffusion and Transformer—driven jointly by text prompts and predefined categorical labels. Contribution/Results: Our approach achieves statistically significant improvements over baselines in subjective listening evaluations. The text interface enables open-domain natural-language control, while the label-based system supports high-precision structured generation; both attain production-level quality. MetaScore and the proposed framework jointly address the longstanding gaps in high-quality annotated symbolic music data and controllable generation infrastructure.

Technology Category

Application Category

📝 Abstract

Recent years have seen many audio-domain text-to-music generation models that rely on large amounts of text-audio pairs for training. However, symbolic-domain controllable music generation has lagged behind partly due to the lack of a large-scale symbolic music dataset with extensive metadata and captions. In this work, we present MetaScore, a new dataset consisting of 963K musical scores paired with rich metadata, including free-form user-annotated tags, collected from an online music forum. To approach text-to-music generation, we leverage a pretrained large language model (LLM) to generate pseudo natural language captions from the metadata. With the LLM-enhanced MetaScore, we train a text-conditioned music generation model that learns to generate symbolic music from the pseudo captions, allowing control of instruments, genre, composer, complexity and other free-form music descriptors. In addition, we train a tag-conditioned system that supports a predefined set of tags available in MetaScore. Our experimental results show that both the proposed text-to-music and tags-to-music models outperform a baseline text-to-music model in a listening test, while the text-based system offers a more natural interface that allows free-form natural language prompts.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale symbolic music dataset with metadata

Need for text-to-symbolic music generation model

Enhancing control over music attributes via natural language

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates pseudo captions from metadata

Text-conditioned model creates symbolic music

Tag-conditioned system supports predefined tags

🔎 Similar Papers

No similar papers found.