Unexplored flaws in multiple-choice VQA evaluations

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a previously overlooked “semantic-neutral prompt formatting bias” in multiple-choice visual question answering (VQA) benchmarks: minor syntactic variations—such as punctuation, line breaks, or option layout—that preserve semantic equivalence nonetheless significantly degrade multimodal large language models’ (MLLMs) answer accuracy, independently of answer ordering and model confidence. Method: We systematically categorize and model three classes of prompt variation factors; conduct large-scale experiments across seven state-of-the-art MLLMs and five VQA datasets, evaluating 48 distinct prompt variants via controlled ablation studies and confidence-consistency analysis. Contribution/Results: We demonstrate that existing debiasing methods are entirely ineffective against this bias. Our findings expose a fundamental methodological flaw in current multimodal evaluation frameworks and underscore the urgent need to redesign prompt robustness assessment paradigms for reliable MLLM evaluation.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving $mathbf{ ext{seven}}$ MLLMs and $mathbf{ ext{five}}$ VQA datasets, spanning $mathbf{48}$ distinct $mathbf{ ext{prompt format variations}}$. Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM's confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.
Problem

Research questions and friction points this paper is trying to address.

Identifies unexplored prompt formatting biases in MLLM evaluations
Analyzes sensitivity of multiple-choice VQA to minor format changes
Shows existing mitigation strategies fail to address these new biases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identified unexplored prompt formatting biases in VQA
Analyzed impact through large-scale multi-model dataset study
Demonstrated existing mitigation strategies fail for new biases
🔎 Similar Papers
No similar papers found.
F
Fabio Rosenthal
Technical University of Munich
S
Sebastian Schmidt
Technical University of Munich
T
Thorsten Graf
Volkswagen AG
T
Thorsten Bagodonat
Volkswagen AG
Stephan Günnemann
Stephan Günnemann
Professor of Computer Science, Technical University of Munich
Machine LearningGraphsGraph Neural NetworksRobustness
Leo Schwinn
Leo Schwinn
Technical University of Munich
Machine LearningDeep LearningAdversarial Attacks