The Rhythm In Anything: Audio-Prompted Drums Generation with Masked Language Modeling

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the challenge of rapidly converting gestural rhythms (e.g., finger drumming, beatboxing) into high-fidelity drum recordings during music creation, this paper introduces TRIA: the first zero-shot, multi-timbre, natural-rhythm-expansion drum sound generation model. Built upon a masked Transformer architecture, TRIA jointly encodes rhythmic audio and textual timbre prompts, leveraging audio–text cross-modal alignment for efficient training on limited public data (<10 hours). Without fine-tuning, it generates expressive, high-fidelity drum audio that faithfully reproduces complex rhythmic patterns and supports diverse timbral expressions. In both objective metrics and subjective listening tests, TRIA significantly outperforms existing baselines. Its zero-shot capability, multi-timbre flexibility, and natural rhythm expansion enable real-time, intuitive musical ideation—substantially enhancing creative workflow efficiency.

Technology Category

Application Category

📝 Abstract

Musicians and nonmusicians alike use rhythmic sound gestures, such as tapping and beatboxing, to express drum patterns. While these gestures effectively communicate musical ideas, realizing these ideas as fully-produced drum recordings can be time-consuming, potentially disrupting many creative workflows. To bridge this gap, we present TRIA (The Rhythm In Anything), a masked transformer model for mapping rhythmic sound gestures to high-fidelity drum recordings. Given an audio prompt of the desired rhythmic pattern and a second prompt to represent drumkit timbre, TRIA produces audio of a drumkit playing the desired rhythm (with appropriate elaborations) in the desired timbre. Subjective and objective evaluations show that a TRIA model trained on less than 10 hours of publicly-available drum data can generate high-quality, faithful realizations of sound gestures across a wide range of timbres in a zero-shot manner.

Problem

Research questions and friction points this paper is trying to address.

Mapping rhythmic sound gestures to drum recordings

Generating high-fidelity drum audio from audio prompts

Converting tapping/beatboxing into produced drum patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked transformer model for drum mapping

Audio prompts for rhythm and timbre

Zero-shot generation with limited training data

🔎 Similar Papers

M6(GPT)3: Generating Multitrack Modifiable Multi-Minute MIDI Music from Text using Genetic Algorithms, Probabilistic Methods and GPT Models in any Progression and Time Signature