Masked Contrastive Pre-Training Improves Music Audio Key Detection

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing self-supervised music foundation models exhibit limited performance on pitch-sensitive key detection tasks. This work systematically investigates how pretraining design influences pitch sensitivity and proposes a mask-contrastive pretraining approach based on Mel-spectrograms, followed by linear evaluation using a shallow, wide MLP on the learned representations. The method demonstrates for the first time that mask-contrastive embeddings substantially enhance key detection accuracy, achieving state-of-the-art results without relying on complex data augmentation. Moreover, the learned representations inherently exhibit robustness to common audio transformations, thereby validating the effectiveness of self-supervised pretraining for pitch-sensitive music information retrieval tasks.

Technology Category

Application Category

📝 Abstract

Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.

Problem

Research questions and friction points this paper is trying to address.

key detection

pitch sensitivity

self-supervised pretraining

music audio

music information retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked contrastive learning

pitch sensitivity

key detection