🤖 AI Summary
Existing deepfake detection models are primarily trained on single-step forgeries (e.g., Face-Swapping, GANs, or Diffusion models) and exhibit poor generalization to multi-step hybrid forgeries (e.g., Face-Swapping → GAN), especially when the final operation lies out-of-distribution—causing severe performance degradation. To address this, we introduce FakeChain, the first large-scale benchmark for 1–3-step deepfake synthesis, encompassing diverse generator combinations and quality configurations. Through spectral analysis and cross-step detection experiments, we reveal that current detectors heavily rely on artifacts from the final operation—suffering up to a 58.83% F1 drop when it changes—while neglecting cumulative manipulation traces. Our key contributions are: (1) a photorealistic, complexity-controllable multi-step benchmark; (2) the first systematic characterization of failure mechanisms under hybrid forgeries; and (3) empirical validation that explicitly modeling the sequence of manipulations is essential for robust detection, thereby advancing deepfake detection toward real-world complexity.
📝 Abstract
Multi-step or hybrid deepfakes, created by sequentially applying different deepfake creation methods such as Face-Swapping, GAN-based generation, and Diffusion methods, can pose an emerging and unforseen technical challenge for detection models trained on single-step forgeries. While prior studies have mainly focused on detecting isolated single manipulation, little is known about the detection model behavior under such compositional, hybrid, and complex manipulation pipelines. In this work, we introduce extbf{FakeChain}, a large-scale benchmark comprising 1-, 2-, and 3-Step forgeries synthesized using five state-of-the-art representative generators. Using this approach, we analyze detection performance and spectral properties across hybrid manipulation at different step, along with varying generator combinations and quality settings. Surprisingly, our findings reveal that detection performance highly depends on the final manipulation type, with F1-score dropping by up to extbf{58.83%} when it differs from training distribution. This clearly demonstrates that detectors rely on last-stage artifacts rather than cumulative manipulation traces, limiting generalization. Such findings highlight the need for detection models to explicitly consider manipulation history and sequences. Our results highlight the importance of benchmarks such as FakeChain, reflecting growing synthesis complexity and diversity in real-world scenarios. Our sample code is available herefootnote{https://github.com/minjihh/FakeChain}.