Investigating Modality Contribution in Audio LLMs for Music

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study investigates whether audio large language models (Audio LLMs) genuinely leverage the audio modality—rather than relying solely on textual reasoning—in music dialogues. To this end, we extend the Shapley-value-based MM-SHAP framework to audio–text multimodal settings for the first time and apply it with the MuChoMusic benchmark to quantify each modality’s attribution to model outputs. Results reveal that while overall audio contribution is low—and high-accuracy models predominantly depend on text—the audio modality is consistently utilized for precise localization of critical acoustic events (e.g., onsets, rhythmic transitions), confirming its functional role. This work introduces the first modality-attribution interpretability method tailored to Audio LLMs, exposing an implicit text bias in current evaluation protocols and providing both theoretical tools and empirical evidence for building trustworthy multimodal music understanding systems.

Technology Category

Application Category

📝 Abstract

Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

Problem

Research questions and friction points this paper is trying to address.

Investigating whether Audio LLMs truly listen to audio or rely on text

Quantifying modality contribution to model predictions using MM-SHAP framework

Evaluating audio localization capability despite low overall audio contribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted MM-SHAP framework for modality contribution

Evaluated two Audio LLMs on the MuChoMusic benchmark

Quantified audio and text modality contributions using Shapley values

🔎 Similar Papers

No similar papers found.