Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Self-supervised learning (SSL) in music information retrieval (MIR) faces a fundamental trade-off: contrastive learning excels at label-based tasks (e.g., instrument classification) but underperforms on structural prediction (e.g., key estimation), whereas equivariant learning suffers from poor generalization. To address this, we propose a multi-label, multi-task SSL framework built upon the ViT-1D architecture. Our method introduces a dual-class token mechanism to jointly model equivariant (circular perfect-fifths) and contrastive objectives, optimized end-to-end via a composite loss combining cross power spectral density (CPSD) and NT-Xent. This design enables complementary representation learning—capturing both local discriminability and global structural invariances. Evaluated with a single linear probe, our model outperforms MERT (using only 1/18 of its parameters) and task-specific ViT-1D baselines across diverse MIR benchmarks. It achieves significant gains in both label classification and structural prediction, demonstrating superior generalization and holistic performance.

Technology Category

Application Category

📝 Abstract
Contrastive learning and equivariant learning are effective methods for self-supervised learning (SSL) for audio content analysis. Yet, their application to music information retrieval (MIR) faces a dilemma: the former is more effective on tagging (e.g., instrument recognition) but less effective on structured prediction (e.g., tonality estimation); The latter can match supervised methods on the specific task it is designed for, but it does not generalize well to other tasks. In this article, we adopt a best-of-both-worlds approach by training a deep neural network on both kinds of pretext tasks at once. The proposed new architecture is a Vision Transformer with 1-D spectrogram patches (ViT-1D), equipped with two class tokens, which are specialized to different self-supervised pretext tasks but optimized through the same model: hence the qualification of self-supervised multi-class-token multitask (MT2). The former class token optimizes cross-power spectral density (CPSD) for equivariant learning over the circle of fifths, while the latter optimizes normalized temperature-scaled cross-entropy (NT-Xent) for contrastive learning. MT2 combines the strengths of both pretext tasks and outperforms consistently both single-class-token ViT-1D models trained with either contrastive or equivariant learning. Averaging the two class tokens further improves performance on several tasks, highlighting the complementary nature of the representations learned by each class token. Furthermore, using the same single-linear-layer probing method on the features of last layer, MT2 outperforms MERT on all tasks except for beat tracking; achieving this with 18x fewer parameters thanks to its multitasking capabilities. Our SSL benchmark demonstrates the versatility of our multi-class-token multitask learning approach for MIR applications.
Problem

Research questions and friction points this paper is trying to address.

Combines contrastive and equivariant learning for MIR
Proposes multi-class-token ViT-1D for multitask SSL
Improves performance with fewer parameters than MERT
Innovation

Methods, ideas, or system contributions that make the work stand out.

ViT-1D with dual class tokens
Combines contrastive and equivariant learning
Multitask self-supervised MIR architecture
🔎 Similar Papers
No similar papers found.