🤖 AI Summary
This work addresses real-time audio-to-accompaniment generation—i.e., low-latency, high-fidelity synthesis of coherent instrumental accompaniment (e.g., guitar) synchronized with a singer’s streaming vocal input. Methodologically, we propose the first systematic streaming modeling framework, introducing a “future visibility–output block duration” trade-off mechanism to explicitly quantify the intrinsic tension among latency, coherence, and throughput. We employ a streaming-trained Transformer decoder augmented with a lookahead prediction objective, overcoming the coherence limitations of conventional maximum-likelihood training in real-time settings. Experiments demonstrate feasibility of generating high-quality accompaniment under realistic system latency constraints. Our results reveal the critical role of proactive temporal modeling for real-time music generation and establish a new paradigm for interactive AI music systems.
📝 Abstract
Music generation models can produce high-fidelity coherent accompaniment given complete audio input, but are limited to editing and loop-based workflows. We study real-time audio-to-audio accompaniment: as a model hears an input audio stream (e.g., a singer singing), it has to also simultaneously generate in real-time a coherent accompanying stream (e.g., a guitar accompaniment). In this work, we propose a model design considering inevitable system delays in practical deployment with two design variables: future visibility $t_f$, the offset between the output playback time and the latest input time used for conditioning, and output chunk duration $k$, the number of frames emitted per call. We train Transformer decoders across a grid of $(t_f,k)$ and show two consistent trade-offs: increasing effective $t_f$ improves coherence by reducing the recency gap, but requires faster inference to stay within the latency budget; increasing $k$ improves throughput but results in degraded accompaniment due to a reduced update rate. Finally, we observe that naive maximum-likelihood streaming training is insufficient for coherent accompaniment where future context is not available, motivating advanced anticipatory and agentic objectives for live jamming.