Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

πŸ“… 2026-02-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited generalization of current AI-generated image (AIGI) detectors in real-world, complex scenarios. The authors propose a simple yet effective approach that leverages frozen foundation vision modelsβ€”such as Perception Encoder, MetaCLIP 2, and DINOv3β€”to extract features, followed by training only a linear classifier for high-performance detection. The study reveals that large-scale pretraining data inherently contains synthetic content, which enables foundation models to spontaneously develop a generalizable ability to discriminate AIGIs. Furthermore, it uncovers distinct mechanisms by which vision-language models and self-supervised models acquire this capability. The method matches specialized detectors on standard benchmarks and achieves over a 30% accuracy improvement on real-world datasets, significantly outperforming existing approaches.

Technology Category

Application Category

πŸ“ Abstract
While specialized detectors for AI-Generated Images (AIGI) achieve near-perfect accuracy on curated benchmarks, they suffer from a dramatic performance collapse in realistic, in-the-wild scenarios. In this work, we demonstrate that simplicity prevails over complex architectural designs. A simple linear classifier trained on the frozen features of modern Vision Foundation Models , including Perception Encoder, MetaCLIP 2, and DINOv3, establishes a new state-of-the-art. Through a comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions, we show that this baseline not only matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets, boosting accuracy by striking margins of over 30\%. We posit that this superior capability is an emergent property driven by the massive scale of pre-training data containing synthetic content. We trace the source of this capability to two distinct manifestations of data exposure: Vision-Language Models internalize an explicit semantic concept of forgery, while Self-Supervised Learning models implicitly acquire discriminative forensic features from the pretraining data. However, we also reveal persistent limitations: these models suffer from performance degradation under recapture and transmission, remain blind to VAE reconstruction and localized editing. We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.
Problem

Research questions and friction points this paper is trying to address.

AI-Generated Images
generalization
in-the-wild detection
foundation models
forensic reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation models
AI-generated image detection
linear probe
in-the-wild generalization
emergent forensic capability
πŸ”Ž Similar Papers
No similar papers found.