Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
Existing audio diffusion models struggle to support real-time interactive music generation due to their non-streaming, bidirectional nature and exhibit lower inference efficiency compared to discrete autoregressive models. This work proposes LMDM, a streaming audio diffusion architecture that integrates block-wise extrapolation, KV caching, and reparameterization of the diffusion process to substantially improve inference speed. Furthermore, it introduces ARC-Forcing—a reinforcement learning–free alignment paradigm enabling stable post-training alignment. The resulting system achieves low-latency interactive generation on consumer-grade hardware, demonstrating strong performance across text-to-music synthesis, sketch-based composition, and real-time improvisational accompaniment. Notably, it has been deployed as a local “generative latency” audio effect in live artistic collaborations, running in real time on gaming laptops.
📝 Abstract
Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.
Problem

Research questions and friction points this paper is trying to address.

interactive music generation
diffusion models
streaming audio
real-time inference
consumer hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Live Music Diffusion Models
KV Caching
ARC-Forcing
interactive music generation
real-time diffusion
🔎 Similar Papers
No similar papers found.