MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current large audio language models struggle to accurately localize temporal events in music and lack fine-grained alignment capabilities over time intervals. To address this gap, this work formally defines and systematically evaluates the temporal localization problem for music foundation models, introducing MusTBENCHβ€”a novel benchmark validated by domain experts that encompasses five types of temporal question-answering tasks. The authors further propose MusT, a four-stage optimization framework integrating music encoder adaptation, large language model alignment, supervised fine-tuning, and reinforcement learning. Experimental results reveal that existing models perform poorly on temporal localization, whereas MusT substantially outperforms strong baselines, establishing precise temporal grounding as a critical capability gap and a key direction for advancing music understanding in large audio language models.
πŸ“ Abstract
Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.
Problem

Research questions and friction points this paper is trying to address.

temporal grounding
music understanding
Large Audio-Language Models
temporal localization
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal grounding
music understanding
Large Audio-Language Models
benchmark
reinforcement learning
πŸ”Ž Similar Papers