SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) suffer from insufficient factual consistency in visual question answering (VQA), yet no benchmark systematically evaluates their factual accuracy. Method: We introduce FActVQA—the first VQA benchmark explicitly designed to assess factual consistency—covering nine objective event and commonsense reasoning tasks across nine diverse domains, with static, high-quality, time-invariant reference answers. We formally define and quantify multimodal factual consistency for MLLMs; propose a high signal-to-noise LLM-as-a-judge automatic evaluation framework; and design a robust, cross-task, cross-scenario assessment protocol. Contribution/Results: Evaluating 18 MLLMs and 8 text-only LLMs, we identify characteristic factual error patterns arising from both image understanding and text generation. FActVQA establishes the first reproducible, scalable, and principled standard for evaluating factual consistency in trustworthy multimodal generation.

Technology Category

Application Category

📝 Abstract
The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 topics. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.
Problem

Research questions and friction points this paper is trying to address.

Evaluates factuality in multimodal models
Assesses accuracy across diverse scenarios
Benchmarks image and text comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark
Quality control processes
LLM-as-a-judge scoring
🔎 Similar Papers
No similar papers found.
X
Xianfu Cheng
Beihang University
W
Wei Zhang
Beihang University
S
Shiwei Zhang
Baidu Inc., China
J
Jian Yang
Beihang University
X
Xiangyuan Guan
Beihang University
X
Xianjie Wu
Beihang University
X
Xiang Li
Beihang University
G
Ge Zhang
M-A-P
J
Jiaheng Liu
M-A-P
Y
Yuying Mai
Beijing Jiaotong University
Y
Yutao Zeng
Beihang University
Zhoufutu Wen
Zhoufutu Wen
ByteDance SEED
LLM Evaluation
Ke Jin
Ke Jin
Professor at Beijing Institute of Technology
Radiation damageIon Beam Analysishigh entropy alloysNuclear Material
B
Baorui Wang
Beihang University
W
Weixiao Zhou
Beihang University
Y
Yunhong Lu
Yantai University
Tongliang Li
Tongliang Li
Beihang University
W
Wenhao Huang
M-A-P
Zhoujun Li
Zhoujun Li
Beihang University
Artificial IntelligentNatural Language ProcessingNetwork Security