STONE: Self-supervised Tonality Estimator

📅 2024-07-10
🏛️ International Society for Music Information Retrieval Conference
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing key estimation methods rely heavily on large-scale annotated datasets, incurring prohibitive labeling costs. This paper introduces STONE, the first fully self-supervised key estimation framework. It first models spectral coherence (CPSD) of unlabeled music segments on the circle of fifths to self-supervise learning of a 12-dimensional key distribution; it then extends this to 24 dimensions and incorporates lightweight supervision to disambiguate major/minor modes. Key contributions include: (i) the first end-to-end self-supervised key modeling pipeline; (ii) empirical demonstration that CPSD pretraining spontaneously induces tonal structure without labels; and (iii) a semi-supervised variant, Semi-TONE, achieving full-supervision performance on the FMAK dataset (5,489 real-world recordings) using only 10% labeled data—while significantly outperforming fully supervised baselines under equivalent labeling budgets. The model employs ChromaNet—a chroma-based architecture with octave equivalence—and outputs structured logits aligned to the circle of fifths.

Technology Category

Application Category

📝 Abstract
Although deep neural networks can estimate the key of a musical piece, their supervision incurs a massive annotation effort. Against this shortcoming, we present STONE, the first self-supervised tonality estimator. The architecture behind STONE, named ChromaNet, is a convnet with octave equivalence which outputs a key signature profile (KSP) of 12 structured logits. First, we train ChromaNet to regress artificial pitch transpositions between any two unlabeled musical excerpts from the same audio track, as measured as cross-power spectral density (CPSD) within the circle of fifths (CoF). We observe that this self-supervised pretext task leads KSP to correlate with tonal key signature. Based on this observation, we extend STONE to output a structured KSP of 24 logits, and introduce supervision so as to disambiguate major versus minor keys sharing the same key signature. Applying different amounts of supervision yields semi-supervised and fully supervised tonality estimators: i.e., Semi-TONEs and Sup-TONEs. We evaluate these estimators on FMAK, a new dataset of 5489 real-world musical recordings with expert annotation of 24 major and minor keys. We find that Semi-TONE matches the classification accuracy of Sup-TONE with reduced supervision and outperforms it with equal supervision.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised key estimation without massive annotation effort
Disambiguating major versus minor keys sharing key signatures
Evaluating accuracy of semi-supervised and fully supervised estimators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised key estimation via pitch transposition regression
ChromaNet architecture with octave-equivariant convnet
Semi-supervised extension disambiguates major/minor keys
🔎 Similar Papers
No similar papers found.