Sound and Music Biases in Deep Music Transcription Models: A Systematic Analysis

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep automatic music transcription (AMT) models achieve strong performance on in-distribution datasets, but their generalization across musical genres, dynamic ranges, and polyphonic complexity remains poorly understood. Method: We introduce the Multi-Axis Distribution Shift (MDS) corpus—the first benchmark explicitly designed to evaluate AMT robustness along multiple musically meaningful axes—and propose a dual-track evaluation framework integrating information retrieval principles with music-aware metrics, including harmony-structure-sensitive measures that address limitations of conventional F1 scores. Results: Experiments reveal that dynamic variation degrades dynamics estimation significantly more than onset detection; timbral bias reduces note-level F1 by 20%, genre bias by 14%; and models exhibit severe robustness degradation under extreme distribution shifts. Our core contribution is the systematic quantification of music-dimensional biases in AMT and the establishment of the first generalization assessment framework grounded in musical structure.

Technology Category

Application Category

📝 Abstract
Automatic Music Transcription (AMT) -- the task of converting music audio into note representations -- has seen rapid progress, driven largely by deep learning systems. Due to the limited availability of richly annotated music datasets, much of the progress in AMT has been concentrated on classical piano music, and even a few very specific datasets. Whether these systems can generalize effectively to other musical contexts remains an open question. Complementing recent studies on distribution shifts in sound (e.g., recording conditions), in this work we investigate the musical dimension -- specifically, variations in genre, dynamics, and polyphony levels. To this end, we introduce the MDS corpus, comprising three distinct subsets -- (1) Genre, (2) Random, and (3) MAEtest -- to emulate different axes of distribution shift. We evaluate the performance of several state-of-the-art AMT systems on the MDS corpus using both traditional information-retrieval and musically-informed performance metrics. Our extensive evaluation isolates and exposes varying degrees of performance degradation under specific distribution shifts. In particular, we measure a note-level F1 performance drop of 20 percentage points due to sound, and 14 due to genre. Generally, we find that dynamics estimation proves more vulnerable to musical variation than onset prediction. Musically informed evaluation metrics, particularly those capturing harmonic structure, help identify potential contributing factors. Furthermore, experiments with randomly generated, non-musical sequences reveal clear limitations in system performance under extreme musical distribution shifts. Altogether, these findings offer new evidence of the persistent impact of the Corpus Bias problem in deep AMT systems.
Problem

Research questions and friction points this paper is trying to address.

Analyzes biases in deep music transcription models
Investigates generalization across genres, dynamics, and polyphony
Evaluates performance degradation under musical distribution shifts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing MDS corpus for distribution shift analysis
Evaluating AMT systems with musically-informed performance metrics
Isolating performance degradation under specific musical variations
🔎 Similar Papers
No similar papers found.
L
Lukáš Samuel Marták
Institute of Computational Perception & LIT AI Lab, Johannes Kepler University, Linz, Austria
P
Patricia Hu
Institute of Computational Perception & LIT AI Lab, Johannes Kepler University, Linz, Austria
Gerhard Widmer
Gerhard Widmer
Professor of Computer Science, Johannes Kepler University Linz
Artificial IntelligenceMachine LearningSound and Music ComputingMusic Information RetrievalComputational Musicology