🤖 AI Summary
This work addresses the challenge of automatic chord recognition, which is hindered by the scarcity and high cost of aligned labeled data. The authors propose a two-stage training framework: first, a pretrained Beat-tracking and Chord (BTC) model generates pseudo-labels for thousands of hours of unlabeled audio to train a lightweight student model; then, continual learning is performed by combining real labels with selective knowledge distillation to effectively mitigate catastrophic forgetting. Evaluated using standard mir_eval metrics, the proposed approach significantly improves performance—the BTC student model outperforms the fully supervised baseline by 2.5% and surpasses the original teacher model by 1.55%. A variant incorporating the 2E1D architecture further boosts performance by 3.79%, nearly matching the teacher’s accuracy, with particularly notable gains on rare chord types.
📝 Abstract
Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available, with selective knowledge distillation (KD) from the teacher applied as a regularizer to prevent catastrophic forgetting of the representations learned in the first stage. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher's performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. And the resulting 2E1D student model improves from the traditional supervised learning baseline by 3.79% on average and achieves almost the same performance as the teacher. Both cases show the large gains on rare chord qualities.