MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of mapping free-form text to structured musical notation in text-to-MIDI generation. We propose a large language model (LLM)-based approach that natively supports MIDI event tokens by extending the LLM vocabulary and introduces a two-stage training strategy: (1) music modeling pretraining on large-scale monomodal MIDI corpora, followed by (2) cross-modal fine-tuning on aligned text–MIDI pairs. The method preserves the original LLM architecture and is fully compatible with the vLLM inference engine. Experiments demonstrate significant improvements in generation quality, textual controllability, and inference latency over prior Text2MIDI models. Specifically, our approach achieves superior multi-track richness, semantic fidelity between input text and output music, and real-time interactivity. This establishes a new paradigm for controllable, efficient, and high-fidelity AI music generation.

Technology Category

Application Category

📝 Abstract

We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM's vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM's parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live demo at https://midi-llm-demo.vercel.app.

Problem

Research questions and friction points this paper is trying to address.

Adapting large language models for text-to-MIDI music generation

Expanding text vocabulary to include MIDI musical tokens

Achieving faster inference while maintaining music quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expands text LLM vocabulary with MIDI tokens

Uses two-stage training for text-to-MIDI conversion

Leverages vLLM library for accelerated inference

🔎 Similar Papers

No similar papers found.