🤖 AI Summary
In contrastive learning, aggressive data augmentation often erodes critical equivariant properties of music—such as tonality and rhythmic structure—thereby degrading downstream performance (e.g., query-by-humming). To address this, we propose Leave-One-Equivariant (LOEV), the first framework in self-supervised music representation learning to explicitly model *selective equivariance*: it preserves invariance under most augmentations while enforcing equivariant responses to music-specific transformations (e.g., pitch shift, time stretch). We further extend LOEV to LOEV++, enabling augmentation-attribute-driven disentangled representation learning and targeted retrieval. Experiments demonstrate that LOEV/LOEV++ significantly mitigates information loss induced by over-strong invariance, yielding substantial gains on augmentation-sensitive tasks and cross-modal music retrieval, while retaining competitive general-purpose representation quality.
📝 Abstract
Contrastive learning has proven effective in self-supervised musical representation learning, particularly for Music Information Retrieval (MIR) tasks. However, reliance on augmentation chains for contrastive view generation and the resulting learnt invariances pose challenges when different downstream tasks require sensitivity to certain musical attributes. To address this, we propose the Leave One EquiVariant (LOEV) framework, which introduces a flexible, task-adaptive approach compared to previous work by selectively preserving information about specific augmentations, allowing the model to maintain task-relevant equivariances. We demonstrate that LOEV alleviates information loss related to learned invariances, improving performance on augmentation related tasks and retrieval without sacrificing general representation quality. Furthermore, we introduce a variant of LOEV, LOEV++, which builds a disentangled latent space by design in a self-supervised manner, and enables targeted retrieval based on augmentation related attributes.