Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study redefines automatic drum transcription (ADT) as an audio-conditioned end-to-end generative task, departing from conventional discriminative spectrogram-based prediction. To address the challenge of jointly modeling binary hit events and continuous velocity values within diffusion models, we propose an Annealed Pseudo-Huber loss that enables their co-optimization. Furthermore, we incorporate features from a Music Foundation Model (MFM) as conditional inputs, substantially improving robustness to out-of-domain audio and generalization capability. The resulting framework supports flexible speed–accuracy trade-offs and exhibits strong rhythmic repair performance. Evaluated on multiple standard ADT benchmarks—including RWC, ENST, and Sinsy—our method achieves state-of-the-art (SOTA) results across all metrics. These findings empirically validate the effectiveness and superiority of the generative paradigm for modeling rhythmic symbolic representations.

Technology Category

Application Category

📝 Abstract

Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Redefining drum transcription as generative task

Overcoming binary-continuous value generation challenges

Enhancing robustness with music foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion modeling transforms noise into drum events

Annealed Pseudo-Huber loss enables joint optimization

Music foundation models enhance spectrogram features

🔎 Similar Papers

No similar papers found.