AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor interpretability of large audio language models (AudioLLMs), whose neurons often activate across multiple unrelated concepts. To resolve this, the authors propose the first mechanism interpretability framework tailored for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemous neural activations into monosemous, human-interpretable features. By integrating representative audio clips, automated feature naming, and human validation, the method constructs a semantically coherent concept ontology. This approach enables, for the first time, systematic disentanglement and semantic annotation of internal representations in AudioLLMs, substantially enhancing model transparency and controllability. It further facilitates efficient concept-based retrieval, intervention, and manipulation, laying a foundation for trustworthy deployment in high-stakes applications.

Technology Category

Application Category

📝 Abstract
Despite strong performance in audio perception tasks, large audio-language models (AudioLLMs) remain opaque to interpretation. A major factor behind this lack of interpretability is that individual neurons in these models frequently activate in response to several unrelated concepts. We introduce the first mechanistic interpretability framework for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemantic activations into monosemantic features. Our pipeline identifies representative audio clips, assigns meaningful names via automated captioning, and validates concepts through human evaluation and steering. Experiments show that AudioLLMs encode structured and interpretable features, enhancing transparency and control. This work provides a foundation for trustworthy deployment in high-stakes domains and enables future extensions to larger models, multilingual audio, and more fine-grained paralinguistic features. Project URL: https://townim-faisal.github.io/AutoInterpret-AudioLLM/
Problem

Research questions and friction points this paper is trying to address.

AudioLLMs
interpretability
polysemantic neurons
mechanistic interpretability
model transparency
Innovation

Methods, ideas, or system contributions that make the work stand out.

mechanistic interpretability
sparse autoencoders
AudioLLMs
monosemantic features
concept steering
🔎 Similar Papers
No similar papers found.