🤖 AI Summary
Deep automatic music transcription (AMT) models achieve strong performance on in-distribution datasets, but their generalization across musical genres, dynamic ranges, and polyphonic complexity remains poorly understood. Method: We introduce the Multi-Axis Distribution Shift (MDS) corpus—the first benchmark explicitly designed to evaluate AMT robustness along multiple musically meaningful axes—and propose a dual-track evaluation framework integrating information retrieval principles with music-aware metrics, including harmony-structure-sensitive measures that address limitations of conventional F1 scores. Results: Experiments reveal that dynamic variation degrades dynamics estimation significantly more than onset detection; timbral bias reduces note-level F1 by 20%, genre bias by 14%; and models exhibit severe robustness degradation under extreme distribution shifts. Our core contribution is the systematic quantification of music-dimensional biases in AMT and the establishment of the first generalization assessment framework grounded in musical structure.
📝 Abstract
Automatic Music Transcription (AMT) -- the task of converting music audio into note representations -- has seen rapid progress, driven largely by deep learning systems. Due to the limited availability of richly annotated music datasets, much of the progress in AMT has been concentrated on classical piano music, and even a few very specific datasets. Whether these systems can generalize effectively to other musical contexts remains an open question. Complementing recent studies on distribution shifts in sound (e.g., recording conditions), in this work we investigate the musical dimension -- specifically, variations in genre, dynamics, and polyphony levels. To this end, we introduce the MDS corpus, comprising three distinct subsets -- (1) Genre, (2) Random, and (3) MAEtest -- to emulate different axes of distribution shift. We evaluate the performance of several state-of-the-art AMT systems on the MDS corpus using both traditional information-retrieval and musically-informed performance metrics. Our extensive evaluation isolates and exposes varying degrees of performance degradation under specific distribution shifts. In particular, we measure a note-level F1 performance drop of 20 percentage points due to sound, and 14 due to genre. Generally, we find that dynamics estimation proves more vulnerable to musical variation than onset prediction. Musically informed evaluation metrics, particularly those capturing harmonic structure, help identify potential contributing factors. Furthermore, experiments with randomly generated, non-musical sequences reveal clear limitations in system performance under extreme musical distribution shifts. Altogether, these findings offer new evidence of the persistent impact of the Corpus Bias problem in deep AMT systems.