Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge that symbolic music foundation models struggle to jointly model absolute and relative musical attributes. We propose the first music-knowledge-driven dual-attribute Transformer architecture. Pretrained on 81.6K hours of MIDI data, it introduces three key innovations: (1) a domain-adaptive dual-attribute tokenization scheme integrating pitch, duration, and other dimensions; (2) a parameter-free multi-dimensional relative attention (MRA) mechanism that explicitly captures intervallic, rhythmic, and harmonic relationships; and (3) a fully autoregressive fine-tuning framework supporting both music understanding and conditional generation—including completion. On three classification tasks across four benchmark datasets, our model achieves superior accuracy and F1 scores compared to state-of-the-art large models. It also significantly outperforms strong Transformer baselines in conditional generation. The code, pretrained models, and generated samples are fully open-sourced.

Technology Category

Application Category

📝 Abstract

Moonbeam is a transformer-based foundation model for symbolic music, pretrained on a large and diverse collection of MIDI data totaling 81.6K hours of music and 18 billion tokens. Moonbeam incorporates music-domain inductive biases by capturing both absolute and relative musical attributes through the introduction of a novel domain-knowledge-inspired tokenization method and Multidimensional Relative Attention (MRA), which captures relative music information without additional trainable parameters. Leveraging the pretrained Moonbeam, we propose 2 finetuning architectures with full anticipatory capabilities, targeting 2 categories of downstream tasks: symbolic music understanding and conditional music generation (including music infilling). Our model outperforms other large-scale pretrained music models in most cases in terms of accuracy and F1 score across 3 downstream music classification tasks on 4 datasets. Moreover, our finetuned conditional music generation model outperforms a strong transformer baseline with a REMI-like tokenizer. We open-source the code, pretrained model, and generated samples on Github.

Problem

Research questions and friction points this paper is trying to address.

Develops a MIDI foundation model using absolute and relative music attributes

Improves symbolic music understanding and conditional generation tasks

Outperforms existing models in accuracy and F1 scores

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based model for symbolic music

Novel tokenization with absolute and relative attributes

Multidimensional Relative Attention for music information

🔎 Similar Papers

Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges