🤖 AI Summary
This work addresses the limited capability of Large Vision-Language Models (LVLMs) in detecting AI-generated images. To this end, we introduce DiffuSyn Bench—the first automated synthetic benchmark for AI-image detection, built upon diffusion models. We propose an end-to-end automatic construction paradigm integrating topic retrieval, narrative script generation, controllable error injection, and diffusion-based image synthesis, enabling highly controllable, reproducible, and diverse text–image pair generation. Through RAG-enhanced reasoning, narrative modeling, error embedding, and human cognitive comparison experiments, we uncover a significant “rightward bias” in LVLMs: their discrimination accuracy is systematically lower than human performance. Our two synthetic benchmarks surpass conventional human-annotated benchmarks in efficiency, diversity, and cross-model comparability. This work establishes a novel evaluation paradigm for LVLM robustness and releases an open-source resource to advance research in trustworthy multimodal AI.
📝 Abstract
This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.