Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses object hallucination in multimodal large language models—where models generate descriptions of objects absent from the input image—often caused by insufficient visual-linguistic alignment. The authors propose a prompt-agnostic, model-agnostic, plug-and-play method that leverages object-centric attention mechanisms within a self-supervised Vision Transformer to construct an auxiliary view. This auxiliary view identifies and masks the most salient yet unsupported visual evidence, thereby strengthening the contrastive signal in Visual Contrastive Decoding (VCD). Requiring only a single, cacheable forward pass, the approach consistently and significantly improves performance across two mainstream object hallucination benchmarks on two distinct multimodal large language models.

Technology Category

Application Category

📝 Abstract
We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
Problem

Research questions and friction points this paper is trying to address.

object hallucination
multimodal large language models
visual contrastive decoding
object-centric attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

object hallucination
visual contrastive decoding
object-aligned auxiliary view
multimodal large language models
self-supervised Vision Transformers
🔎 Similar Papers
No similar papers found.