π€ AI Summary
Existing evaluation methods struggle to disentangle visual knowledge memorization from reasoning capabilities in multimodal large language models and lack precise metrics for atomic-level visual facts. To address this gap, this work proposes WorldVQA, a novel benchmark that explicitly decouples visual knowledge memorization from reasoning. It constructs a dataset spanning head to long-tail visual entities through a hierarchical taxonomy and designs targeted question-answering tasks to specifically assess a modelβs ability to recognize, name, and extract atomic visual knowledge. WorldVQA provides a rigorous and quantifiable evaluation framework that effectively measures both the breadth of a modelβs visual knowledge and its propensity for hallucination, thereby establishing a foundational tool for factuality assessment in multimodal large language models.
π Abstract
We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure"what the model memorizes."The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We expect WorldVQA to serve as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.