🤖 AI Summary
This work addresses two key challenges in multi-instrument audio performance generation: (1) the difficulty of jointly modeling textual semantics and musical score structure, and (2) the lack of fine-grained expressive control. To this end, we propose RenderBox—a novel controllable diffusion framework. Methodologically, it introduces a dual-path (text + score) diffusion generation paradigm, employs a cross-instrument unified Diffusion Transformer architecture, incorporates cross-attention for multimodal conditional modeling, and adopts a progressive curriculum learning strategy for training optimization. Our contributions include the first framework enabling natural-language descriptions and standard musical scores to jointly drive high-fidelity performance synthesis, with precise control over expressive dimensions—including tempo, stylistic nuance, and pitch errors. Experiments demonstrate significant improvements over baselines in FAD, CLAP score, and beat/pitch accuracy. Subjective evaluation confirms high audio naturalness, strong musical expressivity, and excellent prompt alignment.
📝 Abstract
Expressive music performance rendering involves interpreting symbolic scores with variations in timing, dynamics, articulation, and instrument-specific techniques, resulting in performances that capture musical can emotional intent. We introduce RenderBox, a unified framework for text-and-score controlled audio performance generation across multiple instruments, applying coarse-level controls through natural language descriptions and granular-level controls using music scores. Based on a diffusion transformer architecture and cross-attention joint conditioning, we propose a curriculum-based paradigm that trains from plain synthesis to expressive performance, gradually incorporating controllable factors such as speed, mistakes, and style diversity. RenderBox achieves high performance compared to baseline models across key metrics such as FAD and CLAP, and also tempo and pitch accuracy under different prompting tasks. Subjective evaluation further demonstrates that RenderBox is able to generate controllable expressive performances that sound natural and musically engaging, aligning well with prompts and intent.