STAGE: Stemmed Accompaniment Generation through Prefix-Based Conditioning

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing generative models primarily target de novo composition and poorly integrate into iterative human music creation workflows. This paper proposes an interactive, conditional monophonic accompaniment generation method that produces beat-aligned instrumental accompaniments in real time from input mixes. Methodologically: (1) we introduce prefix-based conditional modeling, extending the Transformer embedding matrix to incorporate contextual tokens; (2) we achieve precise beat alignment without auxiliary modules; and (3) we unify MusicGen’s fine-tuning architecture with instruction-based fine-tuning, hybrid audio tokenization, and rhythm-guided metrical-track conditioning. Experiments demonstrate significant improvements in mix coherence, audio fidelity, text-prompt alignment, and beat-structure accuracy—achieving state-of-the-art performance. The system has been deployed in professional music production workflows.

Technology Category

Application Category

📝 Abstract

Recent advances in generative models have made it possible to create high-quality, coherent music, with some systems delivering production-level output.Yet, most existing models focus solely on generating music from scratch, limiting their usefulness for musicians who want to integrate such models into a human, iterative composition workflow.In this paper we introduce STAGE, our STemmed Accompaniment GEneration model, fine-tuned from the state-of-the-art MusicGen to generate single-stem instrumental accompaniments conditioned on a given mixture. Inspired by instruction-tuning methods for language models, we extend the transformer's embedding matrix with a context token, enabling the model to attend to a musical context through prefix-based conditioning.Compared to the baselines, STAGE yields accompaniments that exhibit stronger coherence with the input mixture, higher audio quality, and closer alignment with textual prompts.Moreover, by conditioning on a metronome-like track, our framework naturally supports tempo-constrained generation, achieving state-of-the-art alignment with the target rhythmic structure--all without requiring any additional tempo-specific module.As a result, STAGE offers a practical, versatile tool for interactive music creation that can be readily adopted by musicians in real-world workflows.

Problem

Research questions and friction points this paper is trying to address.

Generates instrumental accompaniments for existing music mixtures

Enhances coherence and audio quality in music generation

Supports tempo-constrained generation without extra modules

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefix-based conditioning for musical context

Fine-tuned MusicGen for stemmed accompaniments

Tempo-constrained generation without extra modules

🔎 Similar Papers

Measuring audio prompt adherence with distribution-based embedding distances