Measuring Epistemic Humility in Multimodal Large Language Models

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from pervasive hallucination, particularly in visual question answering (VQA), where they frequently generate answers inconsistent with the input image—posing significant risks for safety-critical applications. Existing benchmarks emphasize answer accuracy but neglect a crucial capability: “epistemic humility”—the ability to abstain from answering when none of the provided options is correct. To address this gap, we introduce HumbleBench, the first benchmark explicitly designed to evaluate MLLMs’ epistemic humility. It features a novel “none-correct” multiple-choice format, constructed via panoramic scene graph parsing to extract entities and relations, followed by GPT-4-Turbo–assisted question generation and rigorous human curation. Comprehensive evaluation across state-of-the-art MLLMs reveals systematic deficiencies in abstention performance across object, relation, and attribute reasoning tasks—highlighting a critical direction for advancing trustworthy AI.

Technology Category

Application Category

📝 Abstract

Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs' ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a "None of the above" option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including both general-purpose and specialized reasoning models -- on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to reject incorrect answers

Assessing epistemic humility through false-option rejection

Measuring reliability in safety-critical multimodal applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

HumbleBench benchmark tests MLLM hallucination rejection

Uses scene graph annotations and GPT-4-Turbo generation

Includes None-of-above option for epistemic humility evaluation

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning