Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving high-quality accompaniment generation and low latency in real-time human–AI musical co-performance. To this end, we propose a real-time accompaniment generation system based on latent diffusion models, incorporating a sliding-window lookahead prediction mechanism and, for the first time in music accompaniment, applying consistency distillation to dramatically accelerate the sampling process—yielding a 5.4× speedup and enabling truly real-time operation. The system integrates a large Python-based generative model within MAX/MSP and establishes a low-latency communication pipeline via OSC/UDP protocols. Experimental results demonstrate that our approach excels in beat alignment, musical coherence, and audio fidelity, with performance degrading gracefully as the lookahead window size increases.

Technology Category

Application Category

📝 Abstract

We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate.

Problem

Research questions and friction points this paper is trying to address.

real-time

musical co-performance

accompaniment generation

latency

diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent diffusion models

real-time music generation

consistency distillation