MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limited instruction-following and semantic understanding capabilities of symbolic music (e.g., MIDI) in multimodal large language models. To bridge this gap, we propose the first instruction-tuned multimodal large language model specifically designed for symbolic music. Our approach aligns a MusicBERT encoder with Llama-3-8B through a two-stage training strategy and introduces a fine-grained MIDI annotation scheme alongside a feature alignment mechanism. Leveraging the GiantMIDI-Piano dataset, we construct a high-quality MIDI–text paired corpus, overcoming prior reliance on audio or ABC notation. Experimental results demonstrate that our model significantly outperforms baseline methods on music description generation and question-answering tasks, with human evaluations confirming superior performance in musical understanding, emotion recognition, creativity, and overall preference.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal large language models (MLLM) for audio music have demonstrated strong capabilities in music understanding, yet symbolic music, a fundamental representation of musical structure, remains unexplored. In this work, we introduce MIDI-LLaMA, the first instruction-following MLLM for symbolic music understanding. Our approach aligns the MIDI encoder MusicBERT and Llama-3-8B via a two-stage pipeline comprising feature alignment and instruction tuning. To support training, we design a scalable annotation pipeline that annotates GiantMIDI-Piano with fine-grained metadata, resulting in a MIDI-text dataset. Compared with the baseline trained on converting MIDI into ABC notation under the same instruction-tuning procedure, MIDI-LLaMA substantially outperforms in captioning and semantic alignment in question answering. Human evaluation further confirms the advantages of MIDI-LLaMA in music understanding, emotion recognition, creativity, and overall preference. These findings demonstrate that incorporating symbolic music into large language models enhances their capacity for musical understanding.

Problem

Research questions and friction points this paper is trying to address.

symbolic music

multimodal large language model

music understanding

MIDI

instruction-following

Innovation

Methods, ideas, or system contributions that make the work stand out.

symbolic music

instruction-following MLLM

MIDI-LLaMA