DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

📅 2024-06-06

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the limited capability of Large Vision-Language Models (LVLMs) in detecting AI-generated images. To this end, we introduce DiffuSyn Bench—the first automated synthetic benchmark for AI-image detection, built upon diffusion models. We propose an end-to-end automatic construction paradigm integrating topic retrieval, narrative script generation, controllable error injection, and diffusion-based image synthesis, enabling highly controllable, reproducible, and diverse text–image pair generation. Through RAG-enhanced reasoning, narrative modeling, error embedding, and human cognitive comparison experiments, we uncover a significant “rightward bias” in LVLMs: their discrimination accuracy is systematically lower than human performance. Our two synthetic benchmarks surpass conventional human-annotated benchmarks in efficiency, diversity, and cross-model comparability. This work establishes a novel evaluation paradigm for LVLM robustness and releases an open-source resource to advance research in trustworthy multimodal AI.

Technology Category

Application Category

📝 Abstract

This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs' ability to distinguish AI-generated from human-created images

Developing automated benchmark construction for assessing real-world understanding

Identifying performance gaps between LVLMs and humans in image differentiation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmark construction using diffusion models

Embedding intentional errors in synthetic text-image pairs

Scalable evaluation method for vision-language model assessment

🔎 Similar Papers

Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs