🤖 AI Summary
Existing deepfake detection methods predominantly formulate sequential forgery recognition as an image-to-sequence task, relying on generic Transformer architectures that lack explicit modeling of the temporal characteristics of manipulation artifacts. This work proposes a novel sequence-level detection framework specifically designed for temporal modeling of forgery operations, departing from conventional paradigms. We reconstruct the Transformer across three dimensions—texture, shape, and ordering—by introducing: (i) a texture-aware branch and reverse-order prediction mechanism; (ii) diversity-aware pixel-difference attention; (iii) multi-source cross-attention; and (iv) shape-guided Gaussian mapping, collectively enabling explicit modeling of spatiotemporal dependencies and causal structures inherent in manipulations. Evaluated on multiple benchmarks, the proposed method significantly outperforms state-of-the-art approaches, achieving substantial gains in both sequential manipulation recognition accuracy and robustness.
📝 Abstract
Sequential DeepFake detection is an emerging task that predicts the manipulation sequence in order. Existing methods typically formulate it as an image-to-sequence problem, employing conventional Transformer architectures. However, these methods lack dedicated design and consequently result in limited performance. As such, this paper describes a new Transformer design, called TSOM, by exploring three perspectives: Texture, Shape, and Order of Manipulations. Our method features four major improvements: we describe a new texture-aware branch that effectively captures subtle manipulation traces with a Diversiform Pixel Difference Attention module. Then we introduce a Multi-source Cross-attention module to seek deep correlations among spatial and sequential features, enabling effective modeling of complex manipulation traces. To further enhance the cross-attention, we describe a Shape-guided Gaussian mapping strategy, providing initial priors of the manipulation shape. Finally, observing that the subsequent manipulation in a sequence may influence traces left in the preceding one, we intriguingly invert the prediction order from forward to backward, leading to notable gains as expected. Extensive experimental results demonstrate that our method outperforms others by a large margin, highlighting the superiority of our method.