Listenable Maps for Zero-Shot Audio Classifiers

📅 2024-05-27
🏛️ Neural Information Processing Systems
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited interpretability of zero-shot audio classifiers, this paper proposes LMAC-ZS—the first decoder-based posterior explanation method for zero-shot audio classification. LMAC-ZS explicitly reveals the model’s decision rationale in text-audio cross-modal similarity computation via *audible heatmaps*, establishing the first audible explanation paradigm for zero-shot settings. To ensure explanation fidelity, we introduce a novel loss function that strictly enforces decoder outputs to preserve the original CLAP model’s similarity scores. By integrating cross-modal similarity distillation with posterior interpretability modeling, LMAC-ZS achieves high alignment between explanations and zero-shot predictions on the CLAP benchmark. Qualitative analysis demonstrates that the generated audible maps exhibit clear semantic content and strong correlation with diverse text prompts, enabling human-perceivable, attribution-based interpretation.

Technology Category

Application Category

📝 Abstract
Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.
Problem

Research questions and friction points this paper is trying to address.

Interpreting decisions of zero-shot audio classifiers
Ensuring faithfulness in text-audio similarity explanations
Producing meaningful explanations for classifier decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-based post-hoc interpretation method
Novel loss function for similarity faithfulness
Faithful zero-shot classification with CLAP
🔎 Similar Papers
No similar papers found.