Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

πŸ“… 2024-06-27
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing LVLM evaluation benchmarks suffer from data leakage, limited image style diversity, and insufficient coverage of complex interference scenarios, hindering comprehensive assessment of cross-style generalization and robust perception. To address these limitations, we propose Dyscaβ€”the first synthetic-image-based, dynamic, and extensible benchmark for LVLM evaluation. Dysca employs a generative, dynamic construction paradigm integrating Stable Diffusion for image synthesis and rule-driven generation of question-answer pairs. It supports 51 image styles, 20 subtasks, and four categories of realistic interference scenarios, with question formats including multiple-choice, true/false, and open-ended responses. Comprehensive evaluation across 24 open-source and 2 closed-source LVLMs reveals significant deficiencies in cross-style comprehension and noise robustness. Dysca is publicly released, establishing a scalable, leakage-resistant, and highly diverse standard for evaluating LVLM perceptual capabilities.

Technology Category

Application Category

πŸ“ Abstract
Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 24 advanced open-source LVLMs and 2 close-source LVLMs are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released at url{https://github.com/Robin-WZQ/Dysca}.
Problem

Research questions and friction points this paper is trying to address.

LVLM Evaluation
Image Style Variability
Complex Scene Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dysca
Stable Diffusion
Visual Understanding Evaluation
πŸ”Ž Similar Papers
No similar papers found.
J
Jie Zhang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Key Laboratory of Al Safety, Chinese Academy of Sciences, Beijing, 100190, China
Zhongqi Wang
Zhongqi Wang
Institute of Computing Technology, Chinese Academy of Sciences
Model Robustness
Mengqi Lei
Mengqi Lei
PhD student, Tsinghua University
HypergraphComputer VisionVision Language Model
Z
Zheng Yuan
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Key Laboratory of Al Safety, Chinese Academy of Sciences, Beijing, 100190, China
Bei Yan
Bei Yan
Northeastern University
Signal Processing
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition
X
Xilin Chen
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Key Laboratory of Al Safety, Chinese Academy of Sciences, Beijing, 100190, China