SFANet: Spatial-Frequency Attention Network for Deepfake Detection

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deepfake detection models often suffer from poor generalization across diverse datasets and generative techniques. Method: This paper proposes a hybrid detection framework integrating spatial-frequency attention with texture analysis. It innovatively combines facial segmentation and patch-level attention mechanisms, incorporates frequency-domain processing (e.g., DCT decomposition) and local texture modeling, and employs a sequential training strategy to mitigate class imbalance. Architecturally, it unifies Swin Transformer, Vision Transformer (ViT), and an attention fusion network to establish a multimodal feature collaboration representation system. Contribution/Results: Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on the DFWild-Cup multi-source benchmark, significantly improving detection accuracy, cross-domain robustness, and decision interpretability compared to existing approaches.

Technology Category

Application Category

📝 Abstract
Detecting manipulated media has now become a pressing issue with the recent rise of deepfakes. Most existing approaches fail to generalize across diverse datasets and generation techniques. We thus propose a novel ensemble framework, combining the strengths of transformer-based architectures, such as Swin Transformers and ViTs, and texture-based methods, to achieve better detection accuracy and robustness. Our method introduces innovative data-splitting, sequential training, frequency splitting, patch-based attention, and face segmentation techniques to handle dataset imbalances, enhance high-impact regions (e.g., eyes and mouth), and improve generalization. Our model achieves state-of-the-art performance when tested on the DFWild-Cup dataset, a diverse subset of eight deepfake datasets. The ensemble benefits from the complementarity of these approaches, with transformers excelling in global feature extraction and texturebased methods providing interpretability. This work demonstrates that hybrid models can effectively address the evolving challenges of deepfake detection, offering a robust solution for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Detecting manipulated media across diverse datasets and generation techniques
Improving detection accuracy and robustness using hybrid ensemble framework
Handling dataset imbalances while enhancing focus on critical facial regions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines transformers and texture methods for detection
Uses frequency splitting and patch attention techniques
Implements face segmentation to enhance critical regions
🔎 Similar Papers
No similar papers found.
Vrushank Ahire
Vrushank Ahire
B.Tech Undergraduate, Indian Institute of Technology Ropar
Deep LearningAffective ComputingMachine LearningASR
A
Aniruddh Muley
Department of Mathematics and Computing, Indian Institute of Technology Ropar, Punjab, India
S
Shivam Zample
Department of Mathematics and Computing, Indian Institute of Technology Ropar, Punjab, India
S
Siddharth Verma
Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Punjab, India
P
Pranav Menon
Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Punjab, India
S
Surbhi Madan
Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Punjab, India
Abhinav Dhall
Abhinav Dhall
Associate Professor, Monash University
Affective computingComputer VisionHuman-Centered AI