Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models

📅 2023-12-03

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing instruction-tuned large vision-language models (IT-LVLMs) lack standardized benchmarks for evaluating fundamental visual capabilities and cross-modal hallucination detection. Method: We introduce MERLIM, the first dedicated multimodal evaluation benchmark for IT-LVLMs, comprising over 300K image–question pairs. MERLIM establishes a novel evaluation paradigm centered on “implicit hallucination”—characterized by weak visual grounding, language-bias dominance, and failure in fine-grained recognition—integrating multimodal data construction, hallucination annotation protocols, cross-task consistency assessment, and visual–language bias disentanglement analysis. Contribution/Results: Our systematic analysis reveals that state-of-the-art IT-LVLMs frequently exhibit object hallucination, misclassification of fine-grained concepts, and strong preference for linguistic queries. Critically, their outputs rely significantly on LLM priors rather than faithful visual grounding, exposing fundamental limitations in vision–language alignment and robustness.

📝 Abstract

Large Vision and Language Models have enabled significant advances in fully supervised and zero-shot visual tasks. These large architectures serve as the baseline to what is currently known as Instruction Tuning Large Vision and Language models (IT-LVLMs). IT-LVLMs are general-purpose multi-modal assistants whose responses are modulated by natural language instructions and visual data. Despite this versatility, IT-LVLM effectiveness in fundamental computer vision problems remains unclear, primarily due to the absence of a standardized evaluation benchmark. This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal"hallucination"events in IT-LVLMs. Our results bring important insights on the performance of state-of-the-art IT-LVLMs including limitations at identifying fine-grained visual concepts, object hallucinations across tasks, and biases towards the language query. Our findings also suggest that these models have weak visual grounding, but manage to make adequate guesses from global visual patterns or language biases contained in the LLM component. We name this phenomenon of correct answers with no visual grounding as hidden hallucinations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating IT-LVLM effectiveness in fundamental computer vision tasks

Detecting cross-modal hallucination events in multi-modal models

Assessing limitations in fine-grained visual concept identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal benchmark MERLIM for evaluation

Detects cross-modal hallucination events in models

Assesses visual grounding and language biases

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs