🤖 AI Summary
This work addresses the challenge that symbolic music foundation models struggle to jointly model absolute and relative musical attributes. We propose the first music-knowledge-driven dual-attribute Transformer architecture. Pretrained on 81.6K hours of MIDI data, it introduces three key innovations: (1) a domain-adaptive dual-attribute tokenization scheme integrating pitch, duration, and other dimensions; (2) a parameter-free multi-dimensional relative attention (MRA) mechanism that explicitly captures intervallic, rhythmic, and harmonic relationships; and (3) a fully autoregressive fine-tuning framework supporting both music understanding and conditional generation—including completion. On three classification tasks across four benchmark datasets, our model achieves superior accuracy and F1 scores compared to state-of-the-art large models. It also significantly outperforms strong Transformer baselines in conditional generation. The code, pretrained models, and generated samples are fully open-sourced.
📝 Abstract
Moonbeam is a transformer-based foundation model for symbolic music, pretrained on a large and diverse collection of MIDI data totaling 81.6K hours of music and 18 billion tokens. Moonbeam incorporates music-domain inductive biases by capturing both absolute and relative musical attributes through the introduction of a novel domain-knowledge-inspired tokenization method and Multidimensional Relative Attention (MRA), which captures relative music information without additional trainable parameters. Leveraging the pretrained Moonbeam, we propose 2 finetuning architectures with full anticipatory capabilities, targeting 2 categories of downstream tasks: symbolic music understanding and conditional music generation (including music infilling). Our model outperforms other large-scale pretrained music models in most cases in terms of accuracy and F1 score across 3 downstream music classification tasks on 4 datasets. Moreover, our finetuned conditional music generation model outperforms a strong transformer baseline with a REMI-like tokenizer. We open-source the code, pretrained model, and generated samples on Github.