Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient text–music alignment during inference in text-to-MIDI symbolic music generation, this paper proposes a zero-cost, plug-and-play inference optimization framework. Methodologically, it is the first to jointly model text–audio semantic alignment and tonal-harmonic structural consistency within autoregressive sampling: it introduces a contrastive-learning-based text–audio alignment score, a tonality-aware harmonic consistency penalty, and implements sampling reweighting via gradient-guided optimization. Crucially, no model fine-tuning or additional training is required, enabling seamless integration with any pre-trained text-to-MIDI generator. Experiments on the Text2MIDI benchmark demonstrate significant improvements: +12.3% in MIDI BLEU, +18.7% in tonal accuracy, and +34.1% in human preference rate—substantially enhancing both semantic fidelity and musical plausibility of generated outputs.

Technology Category

Application Category

📝 Abstract
We present Text2midi-InferAlign, a novel technique for improving symbolic music generation at inference time. Our method leverages text-to-audio alignment and music structural alignment rewards during inference to encourage the generated music to be consistent with the input caption. Specifically, we introduce two objectives scores: a text-audio consistency score that measures rhythmic alignment between the generated music and the original text caption, and a harmonic consistency score that penalizes generated music containing notes inconsistent with the key. By optimizing these alignment-based objectives during the generation process, our model produces symbolic music that is more closely tied to the input captions, thereby improving the overall quality and coherence of the generated compositions. Our approach can extend any existing autoregressive model without requiring further training or fine-tuning. We evaluate our work on top of Text2midi - an existing text-to-midi generation model, demonstrating significant improvements in both objective and subjective evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

Improving symbolic music generation via inference-time alignment
Enhancing music-text rhythmic and harmonic consistency
Optimizing alignment without retraining autoregressive models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages text-to-audio alignment during inference
Introduces rhythmic and harmonic consistency scores
Extends autoregressive models without retraining
🔎 Similar Papers
No similar papers found.