🤖 AI Summary
This work addresses the challenge of precise tampered region localization in generative image inpainting. We propose DinoLizer, the first framework to leverage the self-supervised ViT model DINOv2 for this task, exploiting its semantically rich patch-level features. DinoLizer employs a lightweight linear classifier head and sliding-window inference to achieve fine-grained tampering detection. Crucially, we introduce a semantics-aware training strategy that explicitly distinguishes semantic modifications (e.g., object insertion/removal) from non-semantic edits (e.g., color or lighting adjustments), significantly improving detection performance for small tampered objects. Evaluated on multi-source benchmarks including B-Free, DinoLizer achieves a 12% average IoU gain over state-of-the-art methods and demonstrates strong robustness against scaling, noise, and JPEG compression. Ablation studies validate the effectiveness of each component.
📝 Abstract
We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a $14 imes 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.