🤖 AI Summary
This study addresses the challenge of generating high-fidelity, expressive drum audio from symbolic drum notation that includes fine-grained timing and velocity information. The work proposes an end-to-end approach that, for the first time, integrates discrete token prediction from neural audio codecs—such as EnCodec, DAC, and X-Codec—into the task of drum grid-to-audio synthesis. Specifically, a Transformer model maps expressive MIDI representations to sequences of codec tokens, which are then decoded into waveforms using a pretrained neural audio decoder. Experiments on the E-GMD dataset demonstrate that the proposed method significantly outperforms baseline systems in both audio fidelity and musical alignment, thereby validating discrete codec token prediction as a viable and effective paradigm for expressive drum audio synthesis.
📝 Abstract
Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.