WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing evaluation methods struggle to disentangle visual knowledge memorization from reasoning capabilities in multimodal large language models and lack precise metrics for atomic-level visual facts. To address this gap, this work proposes WorldVQA, a novel benchmark that explicitly decouples visual knowledge memorization from reasoning. It constructs a dataset spanning head to long-tail visual entities through a hierarchical taxonomy and designs targeted question-answering tasks to specifically assess a model’s ability to recognize, name, and extract atomic visual knowledge. WorldVQA provides a rigorous and quantifiable evaluation framework that effectively measures both the breadth of a model’s visual knowledge and its propensity for hallucination, thereby establishing a foundational tool for factuality assessment in multimodal large language models.

Technology Category

Application Category

📝 Abstract

We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure"what the model memorizes."The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We expect WorldVQA to serve as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.

Problem

Research questions and friction points this paper is trying to address.

WorldVQA

Multimodal Large Language Models

visual world knowledge

knowledge evaluation

visual factuality

Innovation

Methods, ideas, or system contributions that make the work stand out.

WorldVQA

atomic visual knowledge

multimodal large language models