Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the vulnerability of existing deepfake detection models to spatial attention shifts under real-world composite degradations—such as blur and severe compression—which significantly degrade performance. To mitigate this, the authors propose a foundation-driven forensic framework that employs an extreme composite degradation engine to suppress high-frequency artifacts and leverages a multi-stream constrained architecture to guide the DINOv2-Giant backbone in learning invariant geometric and semantic priors. The approach integrates three complementary pathways—global texture, local facial details, and CLIP-derived semantic features—to extract robust representations. A calibrated complementary ensemble mechanism further aggregates multi-stream predictions via discretized voting, effectively curbing background attention drift. The method achieved fourth place in the NTIRE 2026 Robust Deepfake Detection Challenge, demonstrating exceptional zero-shot generalization and attention stability.
📝 Abstract
Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address this vulnerability, we propose a foundation-driven forensic framework that integrates an extreme compound degradation engine with a structurally constrained, multi-stream architecture. During training, our degradation pipeline systematically destroys high-frequency artifacts, optimizing the DINOv2-Giant backbone to extract invariant geometric and semantic priors. We then process images through three specialized pathways: a Global Texture stream, a Localized Facial stream, and a Hybrid Semantic Fusion stream incorporating CLIP. Through analyzing spatial attribution via Score-CAM and feature stability using Cosine Similarity, we quantitatively demonstrate that these streams extract non-redundant, complementary feature representations and stabilize attention entropy. By aggregating these predictions via a calibrated, discretized voting mechanism, our ensemble successfully suppresses background attention drift while acting as a robust geometric anchor. Our approach yields highly stable zero-shot generalization, achieving Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge at CVPR. Code is available at https://github.com/khoalephanminh/ntire26-deepfake-challenge.
Problem

Research questions and friction points this paper is trying to address.

deepfake detection
spatial attention drift
compound degradations
robustness
real-world conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial attention drift
compound degradations
complementary ensembles
geometric priors
zero-shot generalization
🔎 Similar Papers
No similar papers found.