RenderBox: Expressive Performance Rendering with Text Control

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses two key challenges in multi-instrument audio performance generation: (1) the difficulty of jointly modeling textual semantics and musical score structure, and (2) the lack of fine-grained expressive control. To this end, we propose RenderBox—a novel controllable diffusion framework. Methodologically, it introduces a dual-path (text + score) diffusion generation paradigm, employs a cross-instrument unified Diffusion Transformer architecture, incorporates cross-attention for multimodal conditional modeling, and adopts a progressive curriculum learning strategy for training optimization. Our contributions include the first framework enabling natural-language descriptions and standard musical scores to jointly drive high-fidelity performance synthesis, with precise control over expressive dimensions—including tempo, stylistic nuance, and pitch errors. Experiments demonstrate significant improvements over baselines in FAD, CLAP score, and beat/pitch accuracy. Subjective evaluation confirms high audio naturalness, strong musical expressivity, and excellent prompt alignment.

Technology Category

Application Category

📝 Abstract

Expressive music performance rendering involves interpreting symbolic scores with variations in timing, dynamics, articulation, and instrument-specific techniques, resulting in performances that capture musical can emotional intent. We introduce RenderBox, a unified framework for text-and-score controlled audio performance generation across multiple instruments, applying coarse-level controls through natural language descriptions and granular-level controls using music scores. Based on a diffusion transformer architecture and cross-attention joint conditioning, we propose a curriculum-based paradigm that trains from plain synthesis to expressive performance, gradually incorporating controllable factors such as speed, mistakes, and style diversity. RenderBox achieves high performance compared to baseline models across key metrics such as FAD and CLAP, and also tempo and pitch accuracy under different prompting tasks. Subjective evaluation further demonstrates that RenderBox is able to generate controllable expressive performances that sound natural and musically engaging, aligning well with prompts and intent.

Problem

Research questions and friction points this paper is trying to address.

Expressive music performance rendering

Text-and-score controlled audio generation

Diffusion transformer architecture for music synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion transformer architecture

Text-and-score control

Curriculum-based training paradigm

🔎 Similar Papers

No similar papers found.