MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of end-to-end song generation—namely high data and computational demands, limited editability, and difficulty in aligning intermittent vocals with accompaniment in rhythm and harmony. To overcome these issues, the authors propose a staged generation framework that decouples the task into melody composition, vocal synthesis, and accompaniment generation. Central to this approach is the MIDI-SAG method, which leverages symbolic vocal melody MIDI to explicitly guide the rhythmic and harmonic structure of the accompaniment, while an audio continuation mechanism ensures coherence during vocal pauses. Trained on only 2.5k hours of audio using a single RTX 3090 GPU, the model achieves perceptual quality comparable to current open-source end-to-end baselines, substantially reducing resource requirements while enhancing controllability.

Technology Category

Application Category

📝 Abstract
Song generation aims to produce full songs with vocals and accompaniment from lyrics and text descriptions, yet end-to-end models remain data- and compute-intensive and provide limited editability. We advocate a compositional alternative that decomposes the task into melody composition, singing voice synthesis, and singing accompaniment generation. Central to our approach is MIDI-informed singing accompaniment generation (MIDI-SAG), which conditions accompaniment on the symbolic vocal-melody MIDI to improve rhythmic and harmonic alignment between singing and instrumentation. Moreover, beyond conventional SAG settings that assume continuously sung vocals, compositional song generation features intermittent vocals; we address this by combining explicit rhythmic/harmonic controls with audio continuation to keep the backing track consistent across vocal and non-vocal regions. With lightweight newly trained components requiring only 2.5k hours of audio on a single RTX 3090, our pipeline approaches the perceptual quality of recent open-source end-to-end baselines in several metrics. We provide audio demos and will open-source our model at https://composerflow.github.io/web/.
Problem

Research questions and friction points this paper is trying to address.

song generation
singing accompaniment generation
MIDI-informed
compositional pipeline
intermittent vocals
Innovation

Methods, ideas, or system contributions that make the work stand out.

MIDI-informed accompaniment
compositional song generation
singing voice synthesis
rhythmic-harmonic alignment
audio continuation
🔎 Similar Papers
No similar papers found.
Fang-Duo Tsai
Fang-Duo Tsai
National Taiwan University
Music AI
Y
Yi-An Lai
National Taiwan University
F
Fei-Yueh Chen
University of Rochester
H
Hsueh-Wei Fu
National Taiwan University
L
Li Chai
Independent researcher
W
Wei-Jaw Lee
National Taiwan University; Taiwan AI Labs
Hao-Chung Cheng
Hao-Chung Cheng
National Taiwan University
Quantum Information TheoryQuantum Machine LearningMatrix AnalysisStatistical Inference
Yi-Hsuan Yang
Yi-Hsuan Yang
National Taiwan University
Music information retrievalMusic GenerationMusic ProcessingMusic AIAffective computing