🤖 AI Summary
This study investigates whether audio large language models (Audio LLMs) genuinely leverage the audio modality—rather than relying solely on textual reasoning—in music dialogues. To this end, we extend the Shapley-value-based MM-SHAP framework to audio–text multimodal settings for the first time and apply it with the MuChoMusic benchmark to quantify each modality’s attribution to model outputs. Results reveal that while overall audio contribution is low—and high-accuracy models predominantly depend on text—the audio modality is consistently utilized for precise localization of critical acoustic events (e.g., onsets, rhythmic transitions), confirming its functional role. This work introduces the first modality-attribution interpretability method tailored to Audio LLMs, exposing an implicit text bias in current evaluation protocols and providing both theoretical tools and empirical evidence for building trustworthy multimodal music understanding systems.
📝 Abstract
Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.