Masked Contrastive Pre-Training Improves Music Audio Key Detection

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing self-supervised music foundation models exhibit limited performance on pitch-sensitive key detection tasks. This work systematically investigates how pretraining design influences pitch sensitivity and proposes a mask-contrastive pretraining approach based on Mel-spectrograms, followed by linear evaluation using a shallow, wide MLP on the learned representations. The method demonstrates for the first time that mask-contrastive embeddings substantially enhance key detection accuracy, achieving state-of-the-art results without relying on complex data augmentation. Moreover, the learned representations inherently exhibit robustness to common audio transformations, thereby validating the effectiveness of self-supervised pretraining for pitch-sensitive music information retrieval tasks.

Technology Category

Application Category

📝 Abstract
Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.
Problem

Research questions and friction points this paper is trying to address.

key detection
pitch sensitivity
self-supervised pretraining
music audio
music information retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

masked contrastive learning
pitch sensitivity
key detection
self-supervised pretraining
music foundation models
🔎 Similar Papers
No similar papers found.