Streaming Generation for Music Accompaniment

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses real-time audio-to-accompaniment generation—i.e., low-latency, high-fidelity synthesis of coherent instrumental accompaniment (e.g., guitar) synchronized with a singer’s streaming vocal input. Methodologically, we propose the first systematic streaming modeling framework, introducing a “future visibility–output block duration” trade-off mechanism to explicitly quantify the intrinsic tension among latency, coherence, and throughput. We employ a streaming-trained Transformer decoder augmented with a lookahead prediction objective, overcoming the coherence limitations of conventional maximum-likelihood training in real-time settings. Experiments demonstrate feasibility of generating high-quality accompaniment under realistic system latency constraints. Our results reveal the critical role of proactive temporal modeling for real-time music generation and establish a new paradigm for interactive AI music systems.

Technology Category

Application Category

📝 Abstract
Music generation models can produce high-fidelity coherent accompaniment given complete audio input, but are limited to editing and loop-based workflows. We study real-time audio-to-audio accompaniment: as a model hears an input audio stream (e.g., a singer singing), it has to also simultaneously generate in real-time a coherent accompanying stream (e.g., a guitar accompaniment). In this work, we propose a model design considering inevitable system delays in practical deployment with two design variables: future visibility $t_f$, the offset between the output playback time and the latest input time used for conditioning, and output chunk duration $k$, the number of frames emitted per call. We train Transformer decoders across a grid of $(t_f,k)$ and show two consistent trade-offs: increasing effective $t_f$ improves coherence by reducing the recency gap, but requires faster inference to stay within the latency budget; increasing $k$ improves throughput but results in degraded accompaniment due to a reduced update rate. Finally, we observe that naive maximum-likelihood streaming training is insufficient for coherent accompaniment where future context is not available, motivating advanced anticipatory and agentic objectives for live jamming.
Problem

Research questions and friction points this paper is trying to address.

Real-time audio-to-audio music accompaniment generation
Managing system delays and latency trade-offs in streaming
Overcoming limitations of naive streaming training for coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer decoders for real-time audio accompaniment
Future visibility parameter to reduce recency gap
Output chunk duration balancing throughput and coherence
🔎 Similar Papers