🤖 AI Summary
To address the challenge of rapidly converting gestural rhythms (e.g., finger drumming, beatboxing) into high-fidelity drum recordings during music creation, this paper introduces TRIA: the first zero-shot, multi-timbre, natural-rhythm-expansion drum sound generation model. Built upon a masked Transformer architecture, TRIA jointly encodes rhythmic audio and textual timbre prompts, leveraging audio–text cross-modal alignment for efficient training on limited public data (<10 hours). Without fine-tuning, it generates expressive, high-fidelity drum audio that faithfully reproduces complex rhythmic patterns and supports diverse timbral expressions. In both objective metrics and subjective listening tests, TRIA significantly outperforms existing baselines. Its zero-shot capability, multi-timbre flexibility, and natural rhythm expansion enable real-time, intuitive musical ideation—substantially enhancing creative workflow efficiency.
📝 Abstract
Musicians and nonmusicians alike use rhythmic sound gestures, such as tapping and beatboxing, to express drum patterns. While these gestures effectively communicate musical ideas, realizing these ideas as fully-produced drum recordings can be time-consuming, potentially disrupting many creative workflows. To bridge this gap, we present TRIA (The Rhythm In Anything), a masked transformer model for mapping rhythmic sound gestures to high-fidelity drum recordings. Given an audio prompt of the desired rhythmic pattern and a second prompt to represent drumkit timbre, TRIA produces audio of a drumkit playing the desired rhythm (with appropriate elaborations) in the desired timbre. Subjective and objective evaluations show that a TRIA model trained on less than 10 hours of publicly-available drum data can generate high-quality, faithful realizations of sound gestures across a wide range of timbres in a zero-shot manner.