Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study redefines automatic drum transcription (ADT) as an audio-conditioned end-to-end generative task, departing from conventional discriminative spectrogram-based prediction. To address the challenge of jointly modeling binary hit events and continuous velocity values within diffusion models, we propose an Annealed Pseudo-Huber loss that enables their co-optimization. Furthermore, we incorporate features from a Music Foundation Model (MFM) as conditional inputs, substantially improving robustness to out-of-domain audio and generalization capability. The resulting framework supports flexible speed–accuracy trade-offs and exhibits strong rhythmic repair performance. Evaluated on multiple standard ADT benchmarks—including RWC, ENST, and Sinsy—our method achieves state-of-the-art (SOTA) results across all metrics. These findings empirically validate the effectiveness and superiority of the generative paradigm for modeling rhythmic symbolic representations.

Technology Category

Application Category

📝 Abstract
Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Redefining drum transcription as generative task
Overcoming binary-continuous value generation challenges
Enhancing robustness with music foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion modeling transforms noise into drum events
Annealed Pseudo-Huber loss enables joint optimization
Music foundation models enhance spectrogram features
🔎 Similar Papers
No similar papers found.
M
Michael Yeung
Sony Group Corporation, Tokyo, Japan
Keisuke Toyama
Keisuke Toyama
Sony Group Corporation
Audio Signal ProcessingMusic Information RetrievalNatural Language Processing
T
Toya Teramoto
Sony Group Corporation, Tokyo, Japan
Shusuke Takahashi
Shusuke Takahashi
Sony Group Corporation
audio signal processing
T
Tamaki Kojima
Sony Group Corporation, Tokyo, Japan