MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing robustness evaluations for large vision-language models (LVLMs) focus predominantly on textual hallucinations, overlooking the critical challenge posed by misleading visual inputs. Method: We introduce MVI-Bench—the first comprehensive benchmark for evaluating LVLM robustness against visual misdirection—spanning three semantic levels (concept, attribute, relation) and comprising 1,248 expert-annotated VQA instances across six categories. We propose a three-tier taxonomy of visual misdirection and design MVI-Sensitivity, a novel fine-grained metric quantifying model sensitivity to such perturbations. Contribution/Results: Extensive evaluation across 18 state-of-the-art LVLMs reveals pervasive and severe vulnerabilities, especially at the attribute and relational levels. This work provides the first systematic characterization of LVLMs’ visual understanding fragility, establishing a foundational benchmark, principled evaluation methodology, and actionable pathways for robustness-aware modeling and assessment.

Technology Category

Application Category

📝 Abstract

Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLM robustness to misleading visual inputs in real-world applications

Addressing the gap in benchmarks for visual rather than textual deception

Developing metrics to assess fine-grained vulnerabilities in vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

MVI-Bench benchmark for misleading visual inputs

Hierarchical taxonomy across three visual levels

MVI-Sensitivity metric for granular robustness evaluation

🔎 Similar Papers

B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions