π€ AI Summary
This work addresses the challenge of mapping free-form text to structured musical notation in text-to-MIDI generation. We propose a large language model (LLM)-based approach that natively supports MIDI event tokens by extending the LLM vocabulary and introduces a two-stage training strategy: (1) music modeling pretraining on large-scale monomodal MIDI corpora, followed by (2) cross-modal fine-tuning on aligned textβMIDI pairs. The method preserves the original LLM architecture and is fully compatible with the vLLM inference engine. Experiments demonstrate significant improvements in generation quality, textual controllability, and inference latency over prior Text2MIDI models. Specifically, our approach achieves superior multi-track richness, semantic fidelity between input text and output music, and real-time interactivity. This establishes a new paradigm for controllable, efficient, and high-fidelity AI music generation.
π Abstract
We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM's vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM's parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live demo at https://midi-llm-demo.vercel.app.