🤖 AI Summary
This work addresses the challenge of simultaneously achieving high-quality accompaniment generation and low latency in real-time human–AI musical co-performance. To this end, we propose a real-time accompaniment generation system based on latent diffusion models, incorporating a sliding-window lookahead prediction mechanism and, for the first time in music accompaniment, applying consistency distillation to dramatically accelerate the sampling process—yielding a 5.4× speedup and enabling truly real-time operation. The system integrates a large Python-based generative model within MAX/MSP and establishes a low-latency communication pipeline via OSC/UDP protocols. Experimental results demonstrate that our approach excels in beat alignment, musical coherence, and audio fidelity, with performance degrading gracefully as the lookahead window size increases.
📝 Abstract
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate.