MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenges of end-to-end song generation—namely high data and computational demands, limited editability, and difficulty in aligning intermittent vocals with accompaniment in rhythm and harmony. To overcome these issues, the authors propose a staged generation framework that decouples the task into melody composition, vocal synthesis, and accompaniment generation. Central to this approach is the MIDI-SAG method, which leverages symbolic vocal melody MIDI to explicitly guide the rhythmic and harmonic structure of the accompaniment, while an audio continuation mechanism ensures coherence during vocal pauses. Trained on only 2.5k hours of audio using a single RTX 3090 GPU, the model achieves perceptual quality comparable to current open-source end-to-end baselines, substantially reducing resource requirements while enhancing controllability.

Technology Category

Application Category

📝 Abstract

Song generation aims to produce full songs with vocals and accompaniment from lyrics and text descriptions, yet end-to-end models remain data- and compute-intensive and provide limited editability. We advocate a compositional alternative that decomposes the task into melody composition, singing voice synthesis, and singing accompaniment generation. Central to our approach is MIDI-informed singing accompaniment generation (MIDI-SAG), which conditions accompaniment on the symbolic vocal-melody MIDI to improve rhythmic and harmonic alignment between singing and instrumentation. Moreover, beyond conventional SAG settings that assume continuously sung vocals, compositional song generation features intermittent vocals; we address this by combining explicit rhythmic/harmonic controls with audio continuation to keep the backing track consistent across vocal and non-vocal regions. With lightweight newly trained components requiring only 2.5k hours of audio on a single RTX 3090, our pipeline approaches the perceptual quality of recent open-source end-to-end baselines in several metrics. We provide audio demos and will open-source our model at https://composerflow.github.io/web/.

Problem

Research questions and friction points this paper is trying to address.

song generation

singing accompaniment generation

MIDI-informed

compositional pipeline

intermittent vocals

Innovation

Methods, ideas, or system contributions that make the work stand out.

MIDI-informed accompaniment

compositional song generation

singing voice synthesis