AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the poor interpretability of large audio language models (AudioLLMs), whose neurons often activate across multiple unrelated concepts. To resolve this, the authors propose the first mechanism interpretability framework tailored for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemous neural activations into monosemous, human-interpretable features. By integrating representative audio clips, automated feature naming, and human validation, the method constructs a semantically coherent concept ontology. This approach enables, for the first time, systematic disentanglement and semantic annotation of internal representations in AudioLLMs, substantially enhancing model transparency and controllability. It further facilitates efficient concept-based retrieval, intervention, and manipulation, laying a foundation for trustworthy deployment in high-stakes applications.

Technology Category

Application Category

📝 Abstract

Despite strong performance in audio perception tasks, large audio-language models (AudioLLMs) remain opaque to interpretation. A major factor behind this lack of interpretability is that individual neurons in these models frequently activate in response to several unrelated concepts. We introduce the first mechanistic interpretability framework for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemantic activations into monosemantic features. Our pipeline identifies representative audio clips, assigns meaningful names via automated captioning, and validates concepts through human evaluation and steering. Experiments show that AudioLLMs encode structured and interpretable features, enhancing transparency and control. This work provides a foundation for trustworthy deployment in high-stakes domains and enables future extensions to larger models, multilingual audio, and more fine-grained paralinguistic features. Project URL: https://townim-faisal.github.io/AutoInterpret-AudioLLM/

Problem

Research questions and friction points this paper is trying to address.

AudioLLMs

interpretability

polysemantic neurons

mechanistic interpretability

model transparency

Innovation

Methods, ideas, or system contributions that make the work stand out.

mechanistic interpretability

sparse autoencoders

AudioLLMs