FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Current evaluation benchmarks struggle to effectively assess hallucination in multimodal large language models (MLLMs) concerning fine-grained visual perception. To address this gap, this work introduces the first hallucination evaluation benchmark based on high-fidelity, fine-grained counter-commonsense images. By leveraging controllable image editing and systematic prompting strategies—such as Chain-of-Thought reasoning—the proposed approach indirectly yet precisely measures a model’s ability to perceive critical visual details. Empirical analysis reveals that state-of-the-art MLLMs exhibit substantial hallucinations at the fine-grained level, and the study further investigates how different reasoning strategies influence hallucination patterns. These findings offer novel insights and a practical evaluation tool to guide future improvements in MLLM robustness and perceptual fidelity.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the model's ability to perceive target detailed information. Through systematic evaluation of prevailing Chain-of-Thought (CoT) prompting techniques within this task, we reveal critical insights regarding hallucination patterns and model reasoning processes.

Problem

Research questions and friction points this paper is trying to address.

hallucination

multimodal large language models

evaluation benchmark

fine-grained perception

visual hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained hallucination

multimodal benchmark

counter-commonsense editing