🤖 AI Summary
This work addresses the poor interpretability of large audio language models (AudioLLMs), whose neurons often activate across multiple unrelated concepts. To resolve this, the authors propose the first mechanism interpretability framework tailored for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemous neural activations into monosemous, human-interpretable features. By integrating representative audio clips, automated feature naming, and human validation, the method constructs a semantically coherent concept ontology. This approach enables, for the first time, systematic disentanglement and semantic annotation of internal representations in AudioLLMs, substantially enhancing model transparency and controllability. It further facilitates efficient concept-based retrieval, intervention, and manipulation, laying a foundation for trustworthy deployment in high-stakes applications.
📝 Abstract
Despite strong performance in audio perception tasks, large audio-language models (AudioLLMs) remain opaque to interpretation. A major factor behind this lack of interpretability is that individual neurons in these models frequently activate in response to several unrelated concepts. We introduce the first mechanistic interpretability framework for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemantic activations into monosemantic features. Our pipeline identifies representative audio clips, assigns meaningful names via automated captioning, and validates concepts through human evaluation and steering. Experiments show that AudioLLMs encode structured and interpretable features, enhancing transparency and control. This work provides a foundation for trustworthy deployment in high-stakes domains and enables future extensions to larger models, multilingual audio, and more fine-grained paralinguistic features. Project URL: https://townim-faisal.github.io/AutoInterpret-AudioLLM/