Towards Generalizable Deepfake Image Detection with Vision Transformers

๐Ÿ“… 2026-04-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

200K/year
๐Ÿค– AI Summary
This work addresses the limited generalization of existing deepfake detection methods in the face of rapidly evolving generative models and diverse forgery techniques. It proposes a robust detection system by integrating fine-tuned state-of-the-art vision Transformer modelsโ€”DINOv2, AIMv2, and OpenCLIP ViT-L/14โ€”and optimizes them on the large-scale in-the-wild dataset DF-Wild. Evaluated on the DF-Wild test set, the proposed approach achieves an AUC of 96.77% and an EER of 9%, outperforming the current best method by 7.05% in AUC and 8% in EER. This significant improvement in detecting unseen forgery types earned the method first place in the IEEE S&P Cup 2025.

Technology Category

Application Category

๐Ÿ“ Abstract
In today's day and age, we face a challenge in detecting deepfake images because of the fast evolution of modern generative models and the poor generalization capability of existing methods. In this paper, we use an ensemble of fine-tuned vision transformers like DINOv2, AIMv2 and OpenCLIP's ViT-L/14 to create generalizable method to detect deepfakes. We use the DF-Wild dataset released as part of the IEEE SP Cup 2025, because it uses a challenging and diverse set of manipulations and generation techniques. We started our experiments with CNN classifiers trained on spatial features. Experimental results show that our ensemble outperforms individual models and strong CNN baselines, achieving an AUC of 96.77% and an Equal Error Rate (EER) of just 9% on the DF-Wild test set, beating the state-of-the-art deepfake detection algorithm Effort by 7.05% and 8% in AUC and EER respectively. This was the winning solution for SP Cup, presented at ICASSP 2025.
Problem

Research questions and friction points this paper is trying to address.

deepfake detection
generalization
vision transformers
generative models
image manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformers
Deepfake Detection
Model Ensemble
Generalization
DF-Wild Dataset
๐Ÿ”Ž Similar Papers
No similar papers found.
K
Kaliki V Srinanda
Department of Electronics and Communication Engineering, National Institute of Technology Karnataka (NITK), Surathkal - 575025, India
M Manvith Prabhu
M Manvith Prabhu
Btech in ECE, National Institute of Technology-Karnataka (NITK), Surathkal
Deep LearningNatural Language ProcessingComputer VisionSpeechAI
H
Hemanth K Mogilipalem
Department of Information Technology, National Institute of Technology Karnataka (NITK), Surathkal - 575025, India
J
Jayavarapu S Abhinai
Department of Electrical and Electronics Engineering, National Institute of Technology Karnataka (NITK), Surathkal - 575025, India
V
Vaibhav Santhosh
Department of Electrical and Electronics Engineering, National Institute of Technology Karnataka (NITK), Surathkal - 575025, India
A
Aryan Herur
Department of Electronics and Communication Engineering, National Institute of Technology Karnataka (NITK), Surathkal - 575025, India
Deepu Vijayasenan
Deepu Vijayasenan
Professor, NITK, Surathkal, Mangalore, India
Speech processingMachine Learning