CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) frequently generate hallucinated content inconsistent with input images, hindering real-world deployment. To address this, we propose CoFi-Dec—a training-free, model-agnostic decoding framework. It employs a coarse-to-fine visual-conditioned generative self-feedback mechanism to construct hierarchical visual hypotheses, and introduces a Wasserstein-distance-driven geometric consistency constraint to align decoding trajectories across multiple hypothesis scales. The method is plug-and-play, requiring no fine-tuning or additional parameters. Evaluated on six mainstream hallucination benchmarks, CoFi-Dec significantly reduces both entity-level and semantic-level hallucinations, consistently outperforming existing decoding strategies across diverse LVLMs. Its core innovations lie in the synergistic integration of generative self-feedback, multi-scale visual hypothesis modeling, and Wasserstein-based distributional alignment—enabling robust, vision-grounded text generation without architectural or training modifications.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose extbf{CoFi-Dec}, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at https://github.com/AI-Researcher-Team/CoFi-Dec.

Problem

Research questions and friction points this paper is trying to address.

Reduces hallucinations in large vision-language models' outputs

Aligns text predictions with multi-level visual grounding cues

Improves reliability without requiring additional model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine generative feedback for hallucination reduction

Wasserstein-based fusion aligns multi-level visual predictions

Training-free, model-agnostic decoding framework for LVLMs

🔎 Similar Papers

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models