🤖 AI Summary
This work addresses the limitation of existing visual question answering (VQA) benchmarks, which predominantly focus on image perception and inadequately assess models’ ability to leverage external knowledge. The authors propose the first large-scale, human-verified, knowledge-intensive VQA benchmark, constructed by integrating Wikipedia images, article titles, and structured Wikidata knowledge. Using large language models, they generate multiple-choice question-answer pairs that are rigorously validated by human annotators to ensure that correct answers require reasoning beyond visual content alone. The benchmark enforces consistency among visual, textual, and factual knowledge modalities. Evaluation across 15 vision-language models—spanning parameter counts from 256 million to 90 billion—reveals a wide performance range (24.7%–75.6% accuracy), effectively highlighting substantial disparities in models’ capacity for knowledge-driven reasoning.
📝 Abstract
Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.