🤖 AI Summary
This study addresses the lack of systematic evaluation of gradient-based attribution methods for temporal sound event localization in audio classification. It presents the first comprehensive assessment of Integrated Gradients (IG) to determine whether it can effectively localize sound event boundaries using classifiers trained without temporal supervision. The evaluation employs synthetically generated polyphonic audio clips and real-world timestamp-aligned data, with performance measured via Intersection-over-Union (IoU), frame-level F1 score, and Pointing Game accuracy. On a 10-class domestic sound dataset, IG achieves an IoU of 0.39, a frame-level F1 of 0.52, and 82.6% Pointing Game accuracy—significantly outperforming random and energy-based baselines and approaching the performance of models trained with explicit frame-level supervision. These results demonstrate that post-hoc attribution methods hold strong potential for temporal localization in weakly supervised audio tasks.
📝 Abstract
Gradient-based attribution methods can highlight input regions important for neural network predictions, but their effectiveness for temporal sound event detection in audio classification has not been systematically evaluated. This paper assesses whether integrated gradients (IG) can temporally detect sound events when applied to a classifier trained without temporal supervision. We use synthetic polyphonic audio with ground truth timestamps to measure alignment between IG attributions and event boundaries. On a 10-class domestic sound dataset, IG achieves mean Intersection over Union (IoU) of 0.39, frame-level F1 of 0.52, and Pointing Game accuracy of 82.6\%. For comparison, a framewise CNN trained with weak supervision (FW-WS, clip-level training labels) achieves 0.42 IoU, 0.55 F1, and 97.3\% PG, while a strongly supervised variant (FW-SS, frame-level training labels) reaches 0.45 IoU, 0.58 F1, and 97.9\% PG. Overall, these results suggest that post-hoc IG captures meaningful temporal activity patterns of sound events, with localization performance approaching models that explicitly produce frame-level predictions. All methods substantially outperform random and energy-based baselines.