FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation benchmarks struggle to effectively assess hallucination in multimodal large language models (MLLMs) concerning fine-grained visual perception. To address this gap, this work introduces the first hallucination evaluation benchmark based on high-fidelity, fine-grained counter-commonsense images. By leveraging controllable image editing and systematic prompting strategies—such as Chain-of-Thought reasoning—the proposed approach indirectly yet precisely measures a model’s ability to perceive critical visual details. Empirical analysis reveals that state-of-the-art MLLMs exhibit substantial hallucinations at the fine-grained level, and the study further investigates how different reasoning strategies influence hallucination patterns. These findings offer novel insights and a practical evaluation tool to guide future improvements in MLLM robustness and perceptual fidelity.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the model's ability to perceive target detailed information. Through systematic evaluation of prevailing Chain-of-Thought (CoT) prompting techniques within this task, we reveal critical insights regarding hallucination patterns and model reasoning processes.
Problem

Research questions and friction points this paper is trying to address.

hallucination
multimodal large language models
evaluation benchmark
fine-grained perception
visual hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained hallucination
multimodal benchmark
counter-commonsense editing
visual perception evaluation
Chain-of-Thought prompting
🔎 Similar Papers
No similar papers found.
Z
Zhihan Yin
Wangxuan Institute of Computer Technology, Peking University
J
Jianxin Liang
Wangxuan Institute of Computer Technology, Peking University
Yueqian Wang
Yueqian Wang
Peking University
Multimodal Pre-trained Models
Y
Yifeng Yao
Wangxuan Institute of Computer Technology, Peking University
Huishuai Zhang
Huishuai Zhang
Peking University
Deep LearningOptimizationInformation Theory
Dongyan Zhao
Dongyan Zhao
Peking University
Natural Language ProcessingSemantic Data ManagementQADialogue System