Frame-level Temporal Difference Learning for Partial Deepfake Speech Detection

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of expensive frame-level annotations and difficulty in capturing smooth transitional artifacts in deepfake speech detection, this paper proposes the Temporal Difference Attention Module (TDAM), operating without boundary supervision. TDAM introduces a novel frame-level temporal difference analysis perspective, modeling intrinsic dynamic evolution disparities between genuine and spoofed speech via a dual-level differential representation: fine-grained inter-frame differences and coarse-grained inter-segment differences. Integrated with temporal difference attention and adaptive average pooling, it enables hierarchical anomaly perception without explicit frame-level labels. Evaluated on PartialSpoof and HAD datasets, TDAM achieves exceptional detection performance with EERs of 0.59% and 0.03%, respectively—substantially outperforming state-of-the-art methods. This work advances efficient, high-accuracy deepfake speech detection with significantly reduced annotation overhead.

Technology Category

Application Category

📝 Abstract
Detecting partial deepfake speech is essential due to its potential for subtle misinformation. However, existing methods depend on costly frame-level annotations during training, limiting real-world scalability. Also, they focus on detecting transition artifacts between bonafide and deepfake segments. As deepfake generation techniques increasingly smooth these transitions, detection has become more challenging. To address this, our work introduces a new perspective by analyzing frame-level temporal differences and reveals that deepfake speech exhibits erratic directional changes and unnatural local transitions compared to bonafide speech. Based on this finding, we propose a Temporal Difference Attention Module (TDAM) that redefines partial deepfake detection as identifying unnatural temporal variations, without relying on explicit boundary annotations. A dual-level hierarchical difference representation captures temporal irregularities at both fine and coarse scales, while adaptive average pooling preserves essential patterns across variable-length inputs to minimize information loss. Our TDAM-AvgPool model achieves state-of-the-art performance, with an EER of 0.59% on the PartialSpoof dataset and 0.03% on the HAD dataset, which significantly outperforms the existing methods without requiring frame-level supervision.
Problem

Research questions and friction points this paper is trying to address.

Detecting subtle deepfake speech without frame-level annotations
Identifying unnatural temporal variations in deepfake speech
Overcoming challenges from smoothed deepfake transition artifacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Difference Attention Module for detection
Dual-level hierarchical difference representation
Adaptive average pooling minimizes information loss
🔎 Similar Papers
No similar papers found.
Menglu Li
Menglu Li
Toronto Metropolitan University
Audio ProcessingDeep Learning
X
Xiao-Ping Zhang
Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Tsinghua University, and Department of Electrical, Computer and Biomedical Engineering, Toronto Metropolitan University, Toronto, ON, Canada
Lian Zhao
Lian Zhao
Toronto Metropolitan University
Resource managementIoV/IoT NetworksMobile Edge Computing