WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

πŸ“… 2026-01-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

210K/year
πŸ€– AI Summary
Existing evaluation methods struggle to disentangle visual knowledge memorization from reasoning capabilities in multimodal large language models and lack precise metrics for atomic-level visual facts. To address this gap, this work proposes WorldVQA, a novel benchmark that explicitly decouples visual knowledge memorization from reasoning. It constructs a dataset spanning head to long-tail visual entities through a hierarchical taxonomy and designs targeted question-answering tasks to specifically assess a model’s ability to recognize, name, and extract atomic visual knowledge. WorldVQA provides a rigorous and quantifiable evaluation framework that effectively measures both the breadth of a model’s visual knowledge and its propensity for hallucination, thereby establishing a foundational tool for factuality assessment in multimodal large language models.

Technology Category

Application Category

πŸ“ Abstract
We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure"what the model memorizes."The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We expect WorldVQA to serve as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.
Problem

Research questions and friction points this paper is trying to address.

WorldVQA
Multimodal Large Language Models
visual world knowledge
knowledge evaluation
visual factuality
Innovation

Methods, ideas, or system contributions that make the work stand out.

WorldVQA
atomic visual knowledge
multimodal large language models
knowledge-reasoning decoupling
visual factuality