Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

📅 2024-10-11

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

Existing LVLM evaluation benchmarks suffer from static design and data contamination, failing to reflect models’ true capabilities. To address this, we propose VLB—a dynamic multimodal evaluation protocol. VLB employs a vision-language bootstrapping module to generate image-text co-evolving VQA samples in real time, coupled with a cross-modal consistency discrimination module to ensure semantic plausibility. It further introduces controllable joint perturbations and dynamic strategy composition, enabling complexity-adjustable and benchmark-evolvable generative evaluation. VLB establishes the first evaluation framework that co-evolves with LVLM capabilities. On SEEDBench, MMBench, and MME, it significantly mitigates data contamination and more accurately exposes model performance limits and reasoning deficiencies.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a static nature and overlap with the pre-training data, resulting in fixed complexity constraints and data contamination issues. This raises the concern regarding the validity of the evaluation. To address these two challenges, we introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB). VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity. To this end, VLB dynamically generates new visual question-answering samples through a multimodal bootstrapping module that modifies both images and language, while ensuring that newly generated samples remain consistent with the original ones by a judge module. By composing various bootstrapping strategies, VLB offers dynamic variants of existing benchmarks with diverse complexities, enabling the evaluation to co-evolve with the ever-evolving capabilities of LVLMs. Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs.

Problem

Research questions and friction points this paper is trying to address.

Static benchmarks limit evaluation validity for LVLMs

Data contamination issues from overlapping pretraining datasets

Need dynamic complexity to match evolving LVLM capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic multimodal evaluation with flexible complexity

Vision-Language Bootstrapping for robust assessment

Multimodal bootstrapping module modifies images and language

🔎 Similar Papers

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

2024-05-12Information FusionCitations: 0

Authors to Follow