FINER: MLLMs Hallucinate under Fine-grained Negative Queries

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing evaluation benchmarks primarily focus on coarse-grained visual questions, making them inadequate for assessing hallucination in multimodal large language models (MLLMs) under fine-grained negative queries. This work presents the first systematic investigation of this issue and introduces FINER, a fine-grained negative query framework, along with two new benchmarks—FINER-CompreCap and FINER-DOCCI—that encompass multi-object, multi-attribute, multi-relation, and “what”-type questions. Building on this framework, the authors propose FINER-Tuning, a fine-tuning approach that integrates Direct Preference Optimization (DPO) with carefully curated negative data. Evaluated across four state-of-the-art MLLMs, FINER-Tuning reduces hallucination errors by up to 24.2% (on InternVL3.5-14B) and consistently improves performance on eight hallucination-focused benchmarks as well as six general-purpose multimodal evaluation suites.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.

Problem

Research questions and friction points this paper is trying to address.

multimodal large language models

hallucination

fine-grained queries

negative queries

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained negative queries

multimodal hallucination

FINER-Tuning