TinyMU: A Compact Audio-Language Model for Music Understanding

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses the challenges of high training costs, slow inference, and limited deployability of large models on edge devices in music understanding tasks. The authors propose TinyMU, a lightweight music language model with only 229 million parameters, which integrates the MATPAC++ self-supervised audio encoder and a lightweight linear projection alignment module. They also introduce MusicSkills-3.5M, a multi-format music question-answering dataset comprising 3.5 million samples. Through supervised training on diverse types of music-related questions, TinyMU achieves 82% of the performance of current state-of-the-art large models on the MuChoMusic benchmark while reducing model size by a factor of 35, demonstrating the efficacy and potential of compact models in resource-constrained scenarios.

Technology Category

Application Category

📝 Abstract
Music understanding and reasoning are central challenges in the Music Information Research field, with applications ranging from retrieval and recommendation to music agents and virtual assistants. Recent Large Audio-Language Models (LALMs) have shown remarkable progress in answering music-related questions by following user instructions. However, their massive scale, often billions of parameters, results in expensive training, slow inference, and limited deployability on edge devices. In this work, we present TinyMU, a lightweight (229M) Music-Language Model (MLM) that achieves performance comparable to much larger LALMs while remaining efficient and compact. To train TinyMU, we introduce MusicSkills-3.5M, a carefully curated, music-grounded question-answering dataset with 3.5M samples. Spanning multiple-choice, binary, and open-ended formats, this dataset provides fine-grained supervision across diverse musical concepts. For its architecture, TinyMU leverages MATPAC++, the SOTA self-supervised audio encoder for fine-grained feature extraction. Paired with a lightweight linear projector, it efficiently aligns audio embeddings with the language model. Through extensive evaluation, we show that TinyMU performs strongly in both basic music understanding and complex reasoning. Notably, on the MuChoMusic benchmark, it achieves 82\% of SOTA LALM's performance despite being 35x smaller, highlighting the potential of small MLMs under constrained computational budgets.
Problem

Research questions and friction points this paper is trying to address.

Music Understanding
Large Audio-Language Models
Model Efficiency
Edge Deployment
Computational Constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

lightweight audio-language model
music understanding
MusicSkills-3.5M
MATPAC++
efficient multimodal alignment
🔎 Similar Papers
No similar papers found.