What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio deepfake detection models lack interpretable temporal decision-making, particularly regarding fine-grained time-domain units (e.g., speech segments, phonemes, glottal closure instants). Method: We propose a novel explainable AI (XAI) method based on relevance propagation to systematically analyze attention mechanisms in Transformer-based detectors across these temporal units. Contribution/Results: This work presents the first large-scale quantitative evaluation of temporal faithfulness for XAI methods in audio deepfake detection, revealing that conclusions drawn from small-sample studies lack generalizability. Building upon this insight, we design a more robust relevance-driven explanation framework. Experiments demonstrate that our method significantly outperforms Grad-CAM and SHAP across multiple faithfulness metrics. Moreover, it uncovers that prior small-sample studies overestimate the influence of speech onsets/offsets and non-speech segments on detection decisions. Our approach establishes a verifiable, generalizable explanatory paradigm for trustworthy audio deepfake detection.

Technology Category

Application Category

📝 Abstract
Adding explanations to audio deepfake detection (ADD) models will boost their real-world application by providing insight on the decision making process. In this paper, we propose a relevancy-based explainable AI (XAI) method to analyze the predictions of transformer-based ADD models. We compare against standard Grad-CAM and SHAP-based methods, using quantitative faithfulness metrics as well as a partial spoof test, to comprehensively analyze the relative importance of different temporal regions in an audio. We consider large datasets, unlike previous works where only limited utterances are studied, and find that the XAI methods differ in their explanations. The proposed relevancy-based XAI method performs the best overall on a variety of metrics. Further investigation on the relative importance of speech/non-speech, phonetic content, and voice onsets/offsets suggest that the XAI results obtained from analyzing limited utterances don't necessarily hold when evaluated on large datasets.
Problem

Research questions and friction points this paper is trying to address.

Deepfake Detection
Reliability
Transparency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable AI (XAI)
Deep Audio Forgery Detection
Enhanced Interpretability
🔎 Similar Papers
No similar papers found.