MM-R3: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)

📅 2024-10-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inconsistency problem in multimodal large language models (MLLMs), wherein semantically similar inputs yield divergent responses—a critical gap undermining model robustness and trustworthiness, yet long overlooked in prior research. To tackle this, we introduce MM-R³, the first benchmark explicitly designed for evaluating multimodal consistency, comprising three task categories: question paraphrasing, image style transfer, and context-based reasoning. Crucially, we formally establish consistency as an independent, primary evaluation dimension—distinct from accuracy—for MLLMs. Furthermore, we propose a lightweight prompt-level adapter that achieves average absolute consistency improvements of 5.7% and 12.5% on BLIP-2 and LLaVA-1.5M, respectively. Empirical analysis reveals a significant decoupling between accuracy and consistency, demonstrating that high accuracy does not imply high consistency. This work thus opens a new axis for comprehensive MLLM evaluation.

Technology Category

Application Category

📝 Abstract
With the advent of Large Language Models (LLMs) and Multimodal (Visio-lingual) LLMs, a flurry of research has emerged, analyzing the performance of such models across a diverse array of tasks. While most studies focus on evaluating the capabilities of state-of-the-art (SoTA) MLLM models through task accuracy (e.g., Visual Question Answering, grounding) across various datasets, our work explores the related but complementary aspect of consistency - the ability of an MLLM model to produce semantically similar or identical responses to semantically similar queries. We note that consistency is a fundamental prerequisite (necessary but not sufficient condition) for robustness and trust in MLLMs. Humans, in particular, are known to be highly consistent (even if not always accurate) in their responses, and consistency is inherently expected from AI systems. Armed with this perspective, we propose the MM-R$^3$ benchmark, which analyses the performance in terms of consistency and accuracy in SoTA MLLMs with three tasks: Question Rephrasing, Image Restyling, and Context Reasoning. Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa. Furthermore, we propose a simple yet effective mitigation strategy in the form of an adapter module trained to minimize inconsistency across prompts. With our proposed strategy, we are able to achieve absolute improvements of 5.7% and 12.5%, on average on widely used MLLMs such as BLIP-2 and LLaVa 1.5M in terms of consistency over their existing counterparts.
Problem

Research questions and friction points this paper is trying to address.

Analyzing consistency of Vision-Language Models (VLMs) across tasks
Proposing MM-R3 benchmark to evaluate VLM consistency and accuracy
Mitigating inconsistency in VLMs via adapter module training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes MM-R3 benchmark for VLM consistency analysis
Introduces adapter module to reduce inconsistency
Achieves significant consistency improvements in VLMs
🔎 Similar Papers
No similar papers found.