🤖 AI Summary
This study redefines automatic drum transcription (ADT) as an audio-conditioned end-to-end generative task, departing from conventional discriminative spectrogram-based prediction. To address the challenge of jointly modeling binary hit events and continuous velocity values within diffusion models, we propose an Annealed Pseudo-Huber loss that enables their co-optimization. Furthermore, we incorporate features from a Music Foundation Model (MFM) as conditional inputs, substantially improving robustness to out-of-domain audio and generalization capability. The resulting framework supports flexible speed–accuracy trade-offs and exhibits strong rhythmic repair performance. Evaluated on multiple standard ADT benchmarks—including RWC, ENST, and Sinsy—our method achieves state-of-the-art (SOTA) results across all metrics. These findings empirically validate the effectiveness and superiority of the generative paradigm for modeling rhythmic symbolic representations.
📝 Abstract
Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.