🤖 AI Summary
Current multimodal large language models (MLLMs) suffer from insufficient factual consistency in visual question answering (VQA), yet no benchmark systematically evaluates their factual accuracy. Method: We introduce FActVQA—the first VQA benchmark explicitly designed to assess factual consistency—covering nine objective event and commonsense reasoning tasks across nine diverse domains, with static, high-quality, time-invariant reference answers. We formally define and quantify multimodal factual consistency for MLLMs; propose a high signal-to-noise LLM-as-a-judge automatic evaluation framework; and design a robust, cross-task, cross-scenario assessment protocol. Contribution/Results: Evaluating 18 MLLMs and 8 text-only LLMs, we identify characteristic factual error patterns arising from both image understanding and text generation. FActVQA establishes the first reproducible, scalable, and principled standard for evaluating factual consistency in trustworthy multimodal generation.
📝 Abstract
The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 topics. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.