🤖 AI Summary
To address the limited interpretability of zero-shot audio classifiers, this paper proposes LMAC-ZS—the first decoder-based posterior explanation method for zero-shot audio classification. LMAC-ZS explicitly reveals the model’s decision rationale in text-audio cross-modal similarity computation via *audible heatmaps*, establishing the first audible explanation paradigm for zero-shot settings. To ensure explanation fidelity, we introduce a novel loss function that strictly enforces decoder outputs to preserve the original CLAP model’s similarity scores. By integrating cross-modal similarity distillation with posterior interpretability modeling, LMAC-ZS achieves high alignment between explanations and zero-shot predictions on the CLAP benchmark. Qualitative analysis demonstrates that the generated audible maps exhibit clear semantic content and strong correlation with diverse text prompts, enabling human-perceivable, attribution-based interpretation.
📝 Abstract
Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.