Enhancing Automatic Chord Recognition via Pseudo-Labeling and Knowledge Distillation

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of automatic chord recognition, which is hindered by the scarcity and high cost of aligned labeled data. The authors propose a two-stage training framework: first, a pretrained Beat-tracking and Chord (BTC) model generates pseudo-labels for thousands of hours of unlabeled audio to train a lightweight student model; then, continual learning is performed by combining real labels with selective knowledge distillation to effectively mitigate catastrophic forgetting. Evaluated using standard mir_eval metrics, the proposed approach significantly improves performance—the BTC student model outperforms the fully supervised baseline by 2.5% and surpasses the original teacher model by 1.55%. A variant incorporating the 2E1D architecture further boosts performance by 3.79%, nearly matching the teacher’s accuracy, with particularly notable gains on rare chord types.

Technology Category

Application Category

📝 Abstract
Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available, with selective knowledge distillation (KD) from the teacher applied as a regularizer to prevent catastrophic forgetting of the representations learned in the first stage. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher's performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. And the resulting 2E1D student model improves from the traditional supervised learning baseline by 3.79% on average and achieves almost the same performance as the teacher. Both cases show the large gains on rare chord qualities.
Problem

Research questions and friction points this paper is trying to address.

Automatic Chord Recognition
aligned chord labels
label scarcity
pseudo-labeling
knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

pseudo-labeling
knowledge distillation
automatic chord recognition
pre-trained models
two-stage training
🔎 Similar Papers
No similar papers found.
N
Nghia Phan
California State University, Fullerton
Rong Jin
Rong Jin
California State University, Fullerton
Multimedia SystemsAR/VRApplied MLBig Data
G
Gang Liu
Microsoft, Redmond, WA, USA
Xiao Dong
Xiao Dong
Unknown affiliation
DM CV ML