MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
Existing evaluations of multimodal large language models (MLLMs) in visual question answering (VQA), which rely on static datasets and accuracy metrics, struggle to comprehensively assess model robustness and generalization. This work proposes MetaRA—the first framework to introduce metamorphic testing into MLLM-VQA evaluation—by defining metamorphic relations to generate controlled image-question variants that systematically probe model vulnerabilities under diverse conditions. Requiring no ground-truth labels, MetaRA enables model-agnostic consistency verification and uncovers critical failure modes often missed by conventional benchmarks, such as sensitivity to linguistic perturbations, overreliance on visual cues, and flaws in multimodal reasoning. Experiments demonstrate that MetaRA effectively identifies these failure patterns across multiple state-of-the-art MLLMs, offering more fine-grained diagnostic insights than accuracy alone.
📝 Abstract
Visual Question Answering (VQA), as the representative multimodal task, serves as a key benchmark for evaluating the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, existing evaluations largely rely on static datasets and accuracy-based metrics, which fail to capture robustness, consistency, and generalization. Inspired by Metamorphic Testing (MT), we propose Metamorphic Robustness Assessment (MetaRA), a testing framework that employs Metamorphic Relations (MRs) to systematically probe vulnerabilities in MLLM-based VQA systems. MetaRA generates controlled variations of image-question inputs based on specific MRs and evaluates models across diverse conditions. Applying MetaRA to multiple MLLM-based VQA models across different tasks reveals nuanced failure patterns, including sensitivity to linguistic perturbations, over-reliance on superficial visual cues, and deeper weaknesses in multimodal reasoning. Experimental results demonstrate that MetaRA provides richer diagnostic insights than conventional accuracy metrics, exposing failure modes that remain hidden under standard benchmarks. Overall, this work highlights the need for systematic robustness evaluation in VQA and positions metamorphic assessment as a scalable, model-agnostic approach toward trustworthy multimodal AI.
Problem

Research questions and friction points this paper is trying to address.

Visual Question Answering
Multimodal Large Language Models
Robustness Evaluation
Metamorphic Testing
Model Reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Metamorphic Testing
Robustness Assessment
Multimodal Large Language Models
Visual Question Answering
Metamorphic Relations
Q
Quanxing Xu
School of Computer Science and Engineering, Macau University of Science and Technology, Macao SAR 999078, China
Y
Yuhao Tian
School of Computer Science and Engineering, Macau University of Science and Technology, Macao SAR 999078, China
L
Ling Zhou
School of Computer Science and Engineering, Macau University of Science and Technology, Macao SAR 999078, China
X
Xian Zhong
Hubei Key Laboratory of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430070, China; and State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology, Wuhan 430063, China
Xiaohua Huang
Xiaohua Huang
The University of Memphis
Cancer Nanomedicine
Rubing Huang
Rubing Huang
Macau University of Science and Technology
AI for Software EngineeringSoftware Engineering for AISoftware TestingAI Applications
C
Chia-Wen Lin
Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan