OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the lack of general-purpose, open-source foundation models for music audio understanding, this paper introduces the Multi-Feature Masked Token Prediction (MF-MTP) framework, which unifies modeling of multidimensional musical representations—including spectrograms, pitch, and rhythm. The model is pretrained via self-supervised masked classification on 330,000 hours of high-quality music audio, incorporating multi-granularity feature encoding and vector quantization. Evaluated across six core tasks—music tagging, pitch estimation, chord recognition, beat tracking, structural segmentation, and performance difficulty assessment—it consistently outperforms existing open-source self-supervised models. All model weights, training code, and evaluation protocols are fully open-sourced, establishing a reproducible, extensible benchmark foundation model for music information retrieval and understanding research.

Technology Category

Application Category

📝 Abstract

Developing open-source foundation models is essential for advancing research in music audio understanding and ensuring access to powerful, multipurpose representations for music information retrieval. We present OMAR-RQ, a model trained with self-supervision via masked token classification methodologies using a large-scale dataset with over 330,000 hours of music audio. We experiment with different input features and quantization options, and achieve state-of-the-art performance in music tagging, pitch estimation, chord recognition, beat tracking, segmentation, and difficulty estimation among open self-supervised models. We open-source our training and evaluation pipelines and model weights, available at https://github.com/mtg/omar-rq.

Problem

Research questions and friction points this paper is trying to address.

Develop open-source music audio representation model

Improve performance in multiple music information tasks

Provide large-scale self-supervised training methodology

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised masked token classification training

Multi-feature input and quantization experiments

Open-source model with training pipelines

🔎 Similar Papers

No similar papers found.

Cohere

Toronto, San Francisco, New York City, London, Paris, Montreal, Seoul, Germany, PST, EST

Authors to Follow