Mitigating Easy Option Bias in Multiple-Choice Question Answering

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a pervasive “Easy Option Bias” (EOB) in multiple-choice visual question answering (VQA) benchmarks (e.g., MMStar, NExT-QA, SEED-Bench): models achieve high accuracy using only visual input and answer options—without processing the question—because correct options exhibit significantly stronger visual feature alignment with the image than distractors. To address this, we propose GroundAttack, an automated framework that leverages feature-space analysis and adversarial generation to synthesize visually plausible yet challenging negative options highly similar to the image, thereby reconstructing evaluation datasets. On our re-annotated, EOB-mitigated benchmarks, state-of-the-art vision-language models (VLMs) drop to chance-level accuracy when given only image + options, and show substantial performance degradation even with full input—demonstrating persistent reliance on spurious shortcuts. This is the first systematic diagnosis and mitigation of option bias in VQA, advancing evaluation toward genuine question–vision joint reasoning.

Technology Category

Application Category

📝 Abstract
In this early study, we observe an Easy-Options Bias (EOB) issue in some multiple-choice Visual Question Answering (VQA) benchmarks such as MMStar, RealWorldQA, SEED-Bench, Next-QA, STAR benchmark and Video-MME. This bias allows vision-language models (VLMs) to select the correct answer using only the vision (V) and options (O) as inputs, without the need for the question (Q). Through grounding experiments, we attribute the bias to an imbalance in visual relevance: the correct answer typically aligns more closely with the visual contents than the negative options in feature space, creating a shortcut for VLMs to infer the answer via simply vision-option similarity matching. To fix this, we introduce GroundAttack, a toolkit that automatically generates hard negative options as visually plausible as the correct answer. We apply it to the NExT-QA and MMStar datasets, creating new EOB-free annotations. On these EOB-free annotations, current VLMs approach to random accuracies under (V+O) settings, and drop to non-saturated accuracies under (V+Q+O) settings, providing a more realistic evaluation of VLMs' QA ability. Codes and new annotations will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Identifies vision-language models exploit vision-option bias
Addresses shortcut learning in multiple-choice visual question answering
Generates hard negative options to eliminate visual relevance imbalance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates hard negative options automatically
Creates visually plausible distractors for evaluation
Produces EOB-free annotations for realistic assessment
🔎 Similar Papers
No similar papers found.
H
Hao Zhang
Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore; Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore; College of Computing and Data Science, Nanyang Technological University, Singapore
C
Chen Li
Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore; Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore
Basura Fernando
Basura Fernando
Scientist at A*STAR Singapore, Assistant Professor at NTU
Visual ReasoningAction PredictionAction RecognitionTransfer LearningEmbodied AI