Task-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) exhibit limited sensitivity to pixel-level artifacts and weak semantic discriminability in AI-generated image (AIGI) detection due to task-model misalignment. Method: We propose a task-model alignment principle and introduce AlignGemini—a dual-branch architecture wherein one branch leverages a VLM for semantic consistency verification, while the other integrates a pixel-level expert network for fine-grained artifact detection. We further devise an orthogonal supervision-based decoupled training paradigm, applying pure semantic and pure pixel-level supervision separately, and construct a simplified, specialized dataset. Contribution/Results: Experiments demonstrate a 9.5% average accuracy improvement across five real-world benchmark datasets. Our method significantly enhances generalization to unseen generative models and, for the first time, systematically identifies and resolves task-model misalignment as a fundamental bottleneck to generalization in AIGI detection.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) are increasingly adopted for AI-generated images (AIGI) detection, yet converting VLMs into detectors requires substantial resource, while the resulting models still exhibit severe hallucinations. To probe the core issue, we conduct an empirical analysis and observe two characteristic behaviors: (i) fine-tuning VLMs on high-level semantic supervision strengthens semantic discrimination and well generalize to unseen data; (ii) fine-tuning VLMs on low-level pixel-artifact supervision yields poor transfer. We attribute VLMs' underperformance to task-model misalignment: semantics-oriented VLMs inherently lack sensitivity to fine-grained pixel artifacts, and semantically non-discriminative pixel artifacts thus exceeds their inductive biases. In contrast, we observe that conventional pixel-artifact detectors capture low-level pixel artifacts yet exhibit limited semantic awareness relative to VLMs, highlighting that distinct models are better matched to distinct tasks. In this paper, we formalize AIGI detection as two complementary tasks--semantic consistency checking and pixel-artifact detection--and show that neglecting either induces systematic blind spots. Guided by this view, we introduce the Task-Model Alignment principle and instantiate it as a two-branch detector, AlignGemini, comprising a VLM fine-tuned exclusively with pure semantic supervision and a pixel-artifact expert trained exclusively with pure pixel-artifact supervision. By enforcing orthogonal supervision on two simplified datasets, each branch trains to its strengths, producing complementary discrimination over semantic and pixel cues. On five in-the-wild benchmarks, AlignGemini delivers a +9.5 gain in average accuracy, supporting task-model alignment as an effective path to generalizable AIGI detection.
Problem

Research questions and friction points this paper is trying to address.

Detect AI-generated images by aligning models with specific detection tasks
Address hallucinations in Vision Language Models for image detection
Combine semantic and pixel-artifact analysis for improved generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-branch detector with Vision Language Model
Separate semantic and pixel-artifact supervision training
Task-Model Alignment principle for complementary detection
🔎 Similar Papers
No similar papers found.
R
Ruoxin Chen
Tencent Youtu Lab
Jiahui Gao
Jiahui Gao
The University of Hong Kong
Synthetic Data GenerationMultimodal ModelNLP
Kaiqing Lin
Kaiqing Lin
Shenzhen University
Multimedia ForensicsMultimedia SecuritySteganalysis
K
Keyue Zhang
Tencent Youtu Lab
Y
Yandan Zhao
Tencent Youtu Lab
I
Isabel Guan
Hong Kong University of Science and Technology
Taiping Yao
Taiping Yao
Tencent
face anti-spoofing;deepfake;adversial attack
S
Shouhong Ding
Tencent Youtu Lab