MQAD: A Large-Scale Question Answering Dataset for Training Music Large Language Models

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of large-scale, diverse music question-answering (QA) datasets hinders the development of music-oriented large language models (LLMs). To address this, we introduce MQAD—the first large-scale, multimodal music QA dataset encompassing temporally structured musical dimensions including meter, chords, key, and section structure, constructed from over one million songs. Our method uniquely integrates Music Information Retrieval (MIR) models—capable of extracting fine-grained, time-varying musical features—with LLMs to generate high-quality, natural-language QA pairs. We further propose a novel multimodal evaluation framework built upon LLaMA2 and Whisper for robust assessment of music understanding capabilities. Experimental results demonstrate that MQAD substantially improves model performance on music audio description and structural comprehension tasks. The dataset, training code, and evaluation benchmarks are fully open-sourced to foster reproducible research in music AI.

Technology Category

Application Category

📝 Abstract
Question-answering (QA) is a natural approach for humans to understand a piece of music audio. However, for machines, accessing a large-scale dataset covering diverse aspects of music is crucial, yet challenging, due to the scarcity of publicly available music data of this type. This paper introduces MQAD, a music QA dataset built on the Million Song Dataset (MSD), encompassing a rich array of musical features, including beat, chord, key, structure, instrument, and genre -- across 270,000 tracks, featuring nearly 3 million diverse questions and captions. MQAD distinguishes itself by offering detailed time-varying musical information such as chords and sections, enabling exploration into the inherent structure of music within a song. To compile MQAD, our methodology leverages specialized Music Information Retrieval (MIR) models to extract higher-level musical features and Large Language Models (LLMs) to generate natural language QA pairs. Then, we leverage a multimodal LLM that integrates the LLaMA2 and Whisper architectures, along with novel subjective metrics to assess the performance of MQAD. In experiments, our model trained on MQAD demonstrates advancements over conventional music audio captioning approaches. The dataset and code are available at https://github.com/oyzh888/MQAD.
Problem

Research questions and friction points this paper is trying to address.

Creating a large-scale music QA dataset for training LLMs
Addressing scarcity of publicly available music QA data
Enabling detailed time-varying music structure analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging MIR models for music feature extraction
Using LLMs to generate natural language QA pairs
Integrating LLaMA2 and Whisper in multimodal model
🔎 Similar Papers
No similar papers found.
Z
Zhihao Ouyang
ByteDance
Ju-Chiang Wang
Ju-Chiang Wang
ByteDance
Music AIMusic Information RetrievalMachine Learning
D
Daiyu Zhang
ByteDance
B
Bin Chen
ByteDance
S
Shangjie Li
ByteDance
Q
Quan Lin
ByteDance