Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor robustness and limited interpretability of existing audio deepfake detection methods, this paper proposes a novel detection paradigm grounded in acoustic prosodic features—specifically, six classical prosodic dimensions: fundamental frequency (F0), jitter, shimmer, intensity, speech rate, and pause ratio—thereby bypassing low-level acoustic modeling in favor of high-level linguistic representations. The approach innovatively integrates attention mechanisms with feature attribution analysis to ensure model decision transparency, and incorporates L∞-adaptive adversarial training to substantially enhance robustness: under L∞-bounded attacks, accuracy drops only marginally, whereas state-of-the-art models suffer a 99.3% performance collapse. Experiments on standard benchmarks achieve 93% detection accuracy and a 24.7% equal error rate. Crucially, jitter, shimmer, and mean F0 are identified as the most discriminative features, demonstrating the method’s superior performance, strong adversarial robustness, and high interpretability.

Technology Category

Application Category

📝 Abstract
Audio deepfakes are increasingly in-differentiable from organic speech, often fooling both authentication systems and human listeners. While many techniques use low-level audio features or optimization black-box model training, focusing on the features that humans use to recognize speech will likely be a more long-term robust approach to detection. We explore the use of prosody, or the high-level linguistic features of human speech (e.g., pitch, intonation, jitter) as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models used by the community to detect audio deepfakes with an accuracy of 93% and an EER of 24.7%. More importantly, we demonstrate the benefits of using a linguistic features-based approach over existing models by applying an adaptive adversary using an $L_{infty}$ norm attack against the detectors and using attention mechanisms in our training for explainability. We show that we can explain the prosodic features that have highest impact on the model's decision (Jitter, Shimmer and Mean Fundamental Frequency) and that other models are extremely susceptible to simple $L_{infty}$ norm attacks (99.3% relative degradation in accuracy). While overall performance may be similar, we illustrate the robustness and explainability benefits to a prosody feature approach to audio deepfake detection.
Problem

Research questions and friction points this paper is trying to address.

Detect audio deepfakes using prosodic features
Improve robustness against adversarial attacks
Enhance explainability in detection models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses prosodic features for detection
Implements adaptive adversary attacks
Enhances model explainability with attention
🔎 Similar Papers
No similar papers found.