Evaluating the Temporal Detection Capability of Integrated Gradients Applied on Sound Classifier

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of gradient-based attribution methods for temporal sound event localization in audio classification. It presents the first comprehensive assessment of Integrated Gradients (IG) to determine whether it can effectively localize sound event boundaries using classifiers trained without temporal supervision. The evaluation employs synthetically generated polyphonic audio clips and real-world timestamp-aligned data, with performance measured via Intersection-over-Union (IoU), frame-level F1 score, and Pointing Game accuracy. On a 10-class domestic sound dataset, IG achieves an IoU of 0.39, a frame-level F1 of 0.52, and 82.6% Pointing Game accuracy—significantly outperforming random and energy-based baselines and approaching the performance of models trained with explicit frame-level supervision. These results demonstrate that post-hoc attribution methods hold strong potential for temporal localization in weakly supervised audio tasks.

📝 Abstract

Gradient-based attribution methods can highlight input regions important for neural network predictions, but their effectiveness for temporal sound event detection in audio classification has not been systematically evaluated. This paper assesses whether integrated gradients (IG) can temporally detect sound events when applied to a classifier trained without temporal supervision. We use synthetic polyphonic audio with ground truth timestamps to measure alignment between IG attributions and event boundaries. On a 10-class domestic sound dataset, IG achieves mean Intersection over Union (IoU) of 0.39, frame-level F1 of 0.52, and Pointing Game accuracy of 82.6\%. For comparison, a framewise CNN trained with weak supervision (FW-WS, clip-level training labels) achieves 0.42 IoU, 0.55 F1, and 97.3\% PG, while a strongly supervised variant (FW-SS, frame-level training labels) reaches 0.45 IoU, 0.58 F1, and 97.9\% PG. Overall, these results suggest that post-hoc IG captures meaningful temporal activity patterns of sound events, with localization performance approaching models that explicitly produce frame-level predictions. All methods substantially outperform random and energy-based baselines.

Problem

Research questions and friction points this paper is trying to address.

temporal sound event detection

integrated gradients

audio classification

attribution methods

weak supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated Gradients

Temporal Sound Event Detection

Post-hoc Attribution