π€ AI Summary
This work addresses the lack of quantitative evaluation of spectral attribution accuracy in existing explainable AI (XAI) methods for anomalous sound detection, which often rely on subjective visualizations. The authors propose the first objective evaluation framework based on band-wise masking perturbation, systematically removing frequency bands and measuring the resulting changes in model predictions to quantify the alignment between XAI attributions and the modelβs true spectral sensitivity. The framework enables reproducible benchmarking and is applied to evaluate four prominent XAI methods: Integrated Gradients, Occlusion, Grad-CAM, and SmoothGrad. Experimental results demonstrate that Occlusion most accurately reflects the modelβs spectral dependencies, whereas gradient-based methods yield unreliable attributions, thereby validating the necessity and effectiveness of the proposed framework.
π Abstract
Explainable AI (XAI) is commonly applied to anomalous sound detection (ASD) models to identify which time-frequency regions of an audio signal contribute to an anomaly decision. However, most audio explanations rely on qualitative inspection of saliency maps, leaving open the question of whether these attributions accurately reflect the spectral cues the model uses. In this work, we introduce a new quantitative framework for evaluating XAI faithfulness in machine-sound analysis by directly linking attribution relevance to model behaviour through systematic frequency-band removal. This approach provides an objective measure of whether an XAI method for machine ASD correctly identifies frequency regions that influence an ASD model's predictions. By using four widely adopted methods, namely Integrated Gradients, Occlusion, Grad-CAM and SmoothGrad, we show that XAI techniques differ in reliability, with Occlusion demonstrating the strongest alignment with true model sensitivity and gradient-+based methods often failing to accurately capture spectral dependencies. The proposed framework offers a reproducible way to benchmark audio explanations and enables more trustworthy interpretation of spectrogram-based ASD systems.