🤖 AI Summary
This work identifies a pervasive “Easy Option Bias” (EOB) in multiple-choice visual question answering (VQA) benchmarks (e.g., MMStar, NExT-QA, SEED-Bench): models achieve high accuracy using only visual input and answer options—without processing the question—because correct options exhibit significantly stronger visual feature alignment with the image than distractors. To address this, we propose GroundAttack, an automated framework that leverages feature-space analysis and adversarial generation to synthesize visually plausible yet challenging negative options highly similar to the image, thereby reconstructing evaluation datasets. On our re-annotated, EOB-mitigated benchmarks, state-of-the-art vision-language models (VLMs) drop to chance-level accuracy when given only image + options, and show substantial performance degradation even with full input—demonstrating persistent reliance on spurious shortcuts. This is the first systematic diagnosis and mitigation of option bias in VQA, advancing evaluation toward genuine question–vision joint reasoning.
📝 Abstract
In this early study, we observe an Easy-Options Bias (EOB) issue in some multiple-choice Visual Question Answering (VQA) benchmarks such as MMStar, RealWorldQA, SEED-Bench, Next-QA, STAR benchmark and Video-MME. This bias allows vision-language models (VLMs) to select the correct answer using only the vision (V) and options (O) as inputs, without the need for the question (Q). Through grounding experiments, we attribute the bias to an imbalance in visual relevance: the correct answer typically aligns more closely with the visual contents than the negative options in feature space, creating a shortcut for VLMs to infer the answer via simply vision-option similarity matching. To fix this, we introduce GroundAttack, a toolkit that automatically generates hard negative options as visually plausible as the correct answer. We apply it to the NExT-QA and MMStar datasets, creating new EOB-free annotations. On these EOB-free annotations, current VLMs approach to random accuracies under (V+O) settings, and drop to non-saturated accuracies under (V+Q+O) settings, providing a more realistic evaluation of VLMs' QA ability. Codes and new annotations will be released soon.