🤖 AI Summary
Existing evaluations of multimodal large language models (MLLMs) in visual question answering (VQA), which rely on static datasets and accuracy metrics, struggle to comprehensively assess model robustness and generalization. This work proposes MetaRA—the first framework to introduce metamorphic testing into MLLM-VQA evaluation—by defining metamorphic relations to generate controlled image-question variants that systematically probe model vulnerabilities under diverse conditions. Requiring no ground-truth labels, MetaRA enables model-agnostic consistency verification and uncovers critical failure modes often missed by conventional benchmarks, such as sensitivity to linguistic perturbations, overreliance on visual cues, and flaws in multimodal reasoning. Experiments demonstrate that MetaRA effectively identifies these failure patterns across multiple state-of-the-art MLLMs, offering more fine-grained diagnostic insights than accuracy alone.
📝 Abstract
Visual Question Answering (VQA), as the representative multimodal task, serves as a key benchmark for evaluating the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, existing evaluations largely rely on static datasets and accuracy-based metrics, which fail to capture robustness, consistency, and generalization. Inspired by Metamorphic Testing (MT), we propose Metamorphic Robustness Assessment (MetaRA), a testing framework that employs Metamorphic Relations (MRs) to systematically probe vulnerabilities in MLLM-based VQA systems. MetaRA generates controlled variations of image-question inputs based on specific MRs and evaluates models across diverse conditions. Applying MetaRA to multiple MLLM-based VQA models across different tasks reveals nuanced failure patterns, including sensitivity to linguistic perturbations, over-reliance on superficial visual cues, and deeper weaknesses in multimodal reasoning. Experimental results demonstrate that MetaRA provides richer diagnostic insights than conventional accuracy metrics, exposing failure modes that remain hidden under standard benchmarks. Overall, this work highlights the need for systematic robustness evaluation in VQA and positions metamorphic assessment as a scalable, model-agnostic approach toward trustworthy multimodal AI.