🤖 AI Summary
Existing self-supervised music foundation models exhibit limited performance on pitch-sensitive key detection tasks. This work systematically investigates how pretraining design influences pitch sensitivity and proposes a mask-contrastive pretraining approach based on Mel-spectrograms, followed by linear evaluation using a shallow, wide MLP on the learned representations. The method demonstrates for the first time that mask-contrastive embeddings substantially enhance key detection accuracy, achieving state-of-the-art results without relying on complex data augmentation. Moreover, the learned representations inherently exhibit robustness to common audio transformations, thereby validating the effectiveness of self-supervised pretraining for pitch-sensitive music information retrieval tasks.
📝 Abstract
Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.