Celeb-DF++: A Large-scale Challenging Video DeepFake Benchmark for Generalizable Forensics

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Existing DeepFake datasets exhibit limited forgery diversity—typically covering only a single manipulation type—thus hindering the development of generalizable detection models. Method: We introduce Celeb-DF++, the first large-scale video benchmark encompassing three fundamental forgery categories: face swapping, facial reenactment, and talking-face generation. It integrates 22 state-of-the-art generative methods, systematically unifying multiple generation mechanisms with fine-grained facial region modeling to achieve high fidelity and elevated detection difficulty. Contribution/Results: We propose a novel evaluation protocol featuring fine-grained scenario classification and cross-method generalization assessment, revealing for the first time the robustness bottlenecks of mainstream detectors under cross-forgery-type settings. Extensive experiments on 24 SOTA detectors empirically validate their generalization limitations. Celeb-DF++ establishes a new, reproducible benchmark and evaluation framework for universal DeepFake detection.

Technology Category

Application Category

📝 Abstract

The rapid advancement of AI technologies has significantly increased the diversity of DeepFake videos circulating online, posing a pressing challenge for extit{generalizable forensics}, ie, detecting a wide range of unseen DeepFake types using a single model. Addressing this challenge requires datasets that are not only large-scale but also rich in forgery diversity. However, most existing datasets, despite their scale, include only a limited variety of forgery types, making them insufficient for developing generalizable detection methods. Therefore, we build upon our earlier Celeb-DF dataset and introduce {Celeb-DF++}, a new large-scale and challenging video DeepFake benchmark dedicated to the generalizable forensics challenge. Celeb-DF++ covers three commonly encountered forgery scenarios: Face-swap (FS), Face-reenactment (FR), and Talking-face (TF). Each scenario contains a substantial number of high-quality forged videos, generated using a total of 22 various recent DeepFake methods. These methods differ in terms of architectures, generation pipelines, and targeted facial regions, covering the most prevalent DeepFake cases witnessed in the wild. We also introduce evaluation protocols for measuring the generalizability of 24 recent detection methods, highlighting the limitations of existing detection methods and the difficulty of our new dataset.

Problem

Research questions and friction points this paper is trying to address.

Addressing generalizable DeepFake detection across diverse unseen types

Providing a large-scale dataset with rich forgery diversity

Evaluating detection methods' generalizability on varied DeepFake scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale diverse DeepFake video dataset

Covers three common forgery scenarios

Evaluates 24 detection methods' generalizability

🔎 Similar Papers

No similar papers found.