TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

195K/year
🤖 AI Summary
Existing methods for detecting AI-generated images suffer from limited generalization and fail to fully leverage the potential of modern vision foundation models (VFMs). This work presents the first systematic evaluation of various VFMs in their out-of-the-box performance on detecting both AI-generated and AI-edited images. To enhance feature aggregation, the study introduces a lightweight classification head incorporating tunable attention pooling (TAP). Experimental results demonstrate that the proposed approach significantly improves detection accuracy, surpassing the original CLIP model by over 12% across multiple benchmarks and achieving new state-of-the-art performance on two challenging in-the-wild detection datasets.
📝 Abstract
Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.
Problem

Research questions and friction points this paper is trying to address.

AI-generated image detection
vision foundation models
AI image forensics
CLIP
out-of-the-box performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Foundation Models
AI-Generated Image Detection
Tunable Attention Pooling
Patch Tokens
Out-of-the-box Generalization