MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) frequently generate hallucinated text inconsistent with visual content due to insufficient region-level image understanding. To address this, we propose a training-free multi-region fusion decoding method. Our approach introduces, for the first time, a cross-region self-consistency mechanism: it localizes salient image regions via cross-attention, quantifies response consistency across regions using Jensen–Shannon divergence, and performs reliability-weighted fusion guided by chain-of-thought–inspired prompting. Crucially, no model fine-tuning or additional training is required—only inference-time region-aware prompting and consistency modeling are employed to enhance factual accuracy. Extensive experiments across diverse LVLM architectures and mainstream benchmarks demonstrate that our method significantly reduces hallucination rates while improving both visual alignment and factual reliability of generated text.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

Problem

Research questions and friction points this paper is trying to address.

Mitigates hallucinations in LVLMs via multi-region fusion

Improves factual grounding without model retraining

Uses cross-attention and JSD for consistency-aware decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Region Fusion Decoding for hallucination mitigation

Jensen-Shannon Divergence weights responses reliability

Training-free method enhances factual grounding consistency

🔎 Similar Papers

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models