🤖 AI Summary
This work addresses the limitations of traditional piano automatic transcription methods, which model the task as frame-level multi-label classification and are highly sensitive to note timing misalignments, thereby degrading perceptual audio quality. The study introduces optimal transport theory into this domain for the first time, proposing a novel paradigm based on matching time–frequency note event distributions. By minimizing the transport cost between predicted and ground-truth note distributions, the approach yields a loss function that is robust to temporal misalignment and better aligned with human auditory perception. The model integrates a convolutional-recurrent neural network with a time–frequency harmonic-aware attention mechanism to effectively capture both temporal dynamics and harmonic structure in music. Evaluated on the MAESTRO dataset, the method significantly improves note onset detection performance and demonstrates the generalizability of the optimal transport–based loss as a plug-in enhancement for existing transcription systems.
📝 Abstract
This paper describes a novel paradigm that formalizes automatic piano transcription (APT) as an optimal transport (OT) problem, not as a frame-level multi-label binary classification problem. Our method learns to minimize the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency. The OT loss can thus accommodate temporal misalignment, leading to perceptually relevant optimization. We also propose a convolutional recurrent neural network (CRNN) with a harmonics-aware attention mechanism to capture the spectro-temporal dependencies inherent in music.Our experiments using the MAESTRO dataset showed that our method attained a state-of-the-art performance in onset detection. We confirmed the versatility of the OT loss in application to existing models.