IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

๐Ÿ“… 2025-06-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing AIGC detection methods are predominantly opaque black-box binary classifiers lacking unified image-video modeling capabilities, thereby limiting transparency and practical applicability. To address this, we propose IVY-XDETECTORโ€”the first vision-language model enabling end-to-end, cross-modal (image/video) explainable detection. We further introduce IVY-FAKE, a large-scale explainable benchmark comprising over 150,000 samples with natural language inference-based attribution annotations. Our contributions are threefold: (1) a unified, explainable detection framework for both images and videos; (2) the first multimodal AIGC benchmark featuring fine-grained linguistic attribution; and (3) a novel architecture integrating cross-modal alignment, explainable attention mechanisms, and diffusion-artifact modeling to jointly produce detection outputs and human-readable attributions. IVY-XDETECTOR achieves state-of-the-art performance on both image and video detection benchmarks. The IVY-FAKE dataset and model are publicly released on Hugging Face.

Technology Category

Application Category

๐Ÿ“ Abstract
The rapid advancement of Artificial Intelligence Generated Content (AIGC) in visual domains has resulted in highly realistic synthetic images and videos, driven by sophisticated generative frameworks such as diffusion-based architectures. While these breakthroughs open substantial opportunities, they simultaneously raise critical concerns about content authenticity and integrity. Many current AIGC detection methods operate as black-box binary classifiers, which offer limited interpretability, and no approach supports detecting both images and videos in a unified framework. This dual limitation compromises model transparency, reduces trustworthiness, and hinders practical deployment. To address these challenges, we introduce IVY-FAKE , a novel, unified, and large-scale dataset specifically designed for explainable multimodal AIGC detection. Unlike prior benchmarks, which suffer from fragmented modality coverage and sparse annotations, IVY-FAKE contains over 150,000 richly annotated training samples (images and videos) and 18,700 evaluation examples, each accompanied by detailed natural-language reasoning beyond simple binary labels. Building on this, we propose Ivy Explainable Detector (IVY-XDETECTOR), a unified AIGC detection and explainable architecture that jointly performs explainable detection for both image and video content. Our unified vision-language model achieves state-of-the-art performance across multiple image and video detection benchmarks, highlighting the significant advancements enabled by our dataset and modeling framework. Our data is publicly available at https://huggingface.co/datasets/AI-Safeguard/Ivy-Fake.
Problem

Research questions and friction points this paper is trying to address.

Detects synthetic images and videos in unified framework
Addresses lack of interpretability in AIGC detection methods
Provides large-scale annotated dataset for explainable multimodal detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for image and video detection
Explainable vision-language model architecture
Large-scale dataset with rich annotations
๐Ÿ”Ž Similar Papers
No similar papers found.
Wayne Zhang
Wayne Zhang
ฯ€3AI Lab
Changjiang Jiang
Changjiang Jiang
Wuhan University
MLLMRl ReasoningDeep Research
Z
Zhonghao Zhang
ฯ€3AI Lab
C
Chenyang Si
Nanjing University
F
Fengchang Yu
Wuhan University
W
Wei Peng
Stanford University