WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

πŸ“… 2026-01-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing evaluation methods struggle to disentangle visual knowledge memorization from reasoning capabilities in multimodal large language models and lack precise metrics for atomic-level visual facts. To address this gap, this work proposes WorldVQA, a novel benchmark that explicitly decouples visual knowledge memorization from reasoning. It constructs a dataset spanning head to long-tail visual entities through a hierarchical taxonomy and designs targeted question-answering tasks to specifically assess a model’s ability to recognize, name, and extract atomic visual knowledge. WorldVQA provides a rigorous and quantifiable evaluation framework that effectively measures both the breadth of a model’s visual knowledge and its propensity for hallucination, thereby establishing a foundational tool for factuality assessment in multimodal large language models.

Technology Category

Application Category

πŸ“ Abstract
We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure"what the model memorizes."The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We expect WorldVQA to serve as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.
Problem

Research questions and friction points this paper is trying to address.

WorldVQA
Multimodal Large Language Models
visual world knowledge
knowledge evaluation
visual factuality
Innovation

Methods, ideas, or system contributions that make the work stand out.

WorldVQA
atomic visual knowledge
multimodal large language models
knowledge-reasoning decoupling
visual factuality
R
Runjie Zhou
Moonshot AI
Y
Youbo Shao
Moonshot AI
Haoyu Lu
Haoyu Lu
Renmin University of China | Moonshot AI
multimodal foundation modelvideo-language modeling
B
Bowei Xing
Moonshot AI
T
Tongtong Bai
Moonshot AI
Y
Yujie Chen
Moonshot AI
J
Jie Zhao
Moonshot AI
Lin Sui
Lin Sui
Moonshot AI Ltd
Computer Vision
H
Haotian Yao
Moonshot AI
Zijia Zhao
Zijia Zhao
Institute of Automation, Chinese Academy Sciences (CASIA)
Multimodal learning
Hao Yang
Hao Yang
Moonshot AI
AIGC
Haoning Wu
Haoning Wu
Shanghai Jiao Tong University
Computer VisionMulti-modal LearningGenerative Models
Z
Zaida Zhou
Moonshot AI
J
Jinguo Zhu
Moonshot AI
Zhiqi Huang
Zhiqi Huang
Moonshot AI
LLM
Y
Yiping Bao
Moonshot AI
Yangyang Liu
Yangyang Liu
casia
OCRDeep Learning
Y
Y.Charles
Moonshot AI
X
Xinyu Zhou
Moonshot AI