🤖 AI Summary
AI-generated face detectors suffer from poor generalization and strong dependence on specific generative models. Method: This paper proposes a two-level self-supervised learning framework that requires no synthetic data. The inner loop jointly optimizes multiple EXIF-related pretext tasks (classification and ranking), while the outer loop dynamically adjusts task weights to align with coarse-grained tampering perception—enabling pretraining of a visual encoder using only authentic face images. Contribution/Results: By embedding multi-task weight learning into the bilevel optimization structure, our approach achieves, for the first time, purely self-supervised AI face detection. It integrates linearly weighted multi-task pretraining, Gaussian Mixture Model (GMM)-based anomaly detection, and a lightweight two-stage classifier. Our method significantly outperforms state-of-the-art approaches under both one-class and binary classification settings, demonstrating exceptional generalization to unseen generative models.
📝 Abstract
AI-generated face detectors trained via supervised learning typically rely on synthesized images from specific generators, limiting their generalization to emerging generative techniques. To overcome this limitation, we introduce a self-supervised method based on bi-level optimization. In the inner loop, we pretrain a vision encoder only on photographic face images using a set of linearly weighted pretext tasks: classification of categorical exchangeable image file format (EXIF) tags, ranking of ordinal EXIF tags, and detection of artificial face manipulations. The outer loop then optimizes the relative weights of these pretext tasks to enhance the coarse-grained detection of manipulated faces, serving as a proxy task for identifying AI-generated faces. In doing so, it aligns self-supervised learning more closely with the ultimate goal of AI-generated face detection. Once pretrained, the encoder remains fixed, and AI-generated faces are detected either as anomalies under a Gaussian mixture model fitted to photographic face features or by a lightweight two-layer perceptron serving as a binary classifier. Extensive experiments demonstrate that our detectors significantly outperform existing approaches in both one-class and binary classification settings, exhibiting strong generalization to unseen generators.