🤖 AI Summary
Controllable symbolic music generation is hindered by the scarcity of large-scale, high-quality datasets annotated with rich metadata (e.g., instrumentation, style, composer) and titles.
Method: We introduce MetaScore—the first 963K-sample symbolic music dataset with fine-grained annotations—and propose a novel LLM-augmented data construction paradigm: leveraging large language models to generate pseudo-natural-language descriptions for scores, integrated with REMI/ABC representations and metadata-driven alignment. Based on MetaScore, we train dual-path conditional generative models—diffusion and Transformer—driven jointly by text prompts and predefined categorical labels.
Contribution/Results: Our approach achieves statistically significant improvements over baselines in subjective listening evaluations. The text interface enables open-domain natural-language control, while the label-based system supports high-precision structured generation; both attain production-level quality. MetaScore and the proposed framework jointly address the longstanding gaps in high-quality annotated symbolic music data and controllable generation infrastructure.
📝 Abstract
Recent years have seen many audio-domain text-to-music generation models that rely on large amounts of text-audio pairs for training. However, symbolic-domain controllable music generation has lagged behind partly due to the lack of a large-scale symbolic music dataset with extensive metadata and captions. In this work, we present MetaScore, a new dataset consisting of 963K musical scores paired with rich metadata, including free-form user-annotated tags, collected from an online music forum. To approach text-to-music generation, we leverage a pretrained large language model (LLM) to generate pseudo natural language captions from the metadata. With the LLM-enhanced MetaScore, we train a text-conditioned music generation model that learns to generate symbolic music from the pseudo captions, allowing control of instruments, genre, composer, complexity and other free-form music descriptors. In addition, we train a tag-conditioned system that supports a predefined set of tags available in MetaScore. Our experimental results show that both the proposed text-to-music and tags-to-music models outperform a baseline text-to-music model in a listening test, while the text-based system offers a more natural interface that allows free-form natural language prompts.