🤖 AI Summary
Existing large vision-language models (LVLMs) frequently overlook image necessity, leading to insufficient visual grounding and pervasive hallucinations; meanwhile, mainstream benchmarks lack quantitative evaluation of whether answers genuinely depend on visual input. This work proposes a unified framework for visual grounding: (1) an extensible training paradigm emphasizing strong visual dependency, integrating multi-stage enhancement and image-masking contrastive learning; (2) ViN-Bench—the first rigorous benchmark explicitly validating image necessity via counterfactual ablation and human-verified ground truth; and (3) a cross-modal attention attribution method to identify critical vision–language alignment heads. Experiments demonstrate substantial accuracy gains on image-essential tasks, empirically verify authentic reliance on visual evidence, and enable interpretable diagnosis of hallucination sources. The framework bridges the gap between visual grounding fidelity and model transparency, advancing both robustness and explainability in LVLMs.
📝 Abstract
Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. Furthermore, current vision-language benchmarks are not specifically measuring the degree to which the answer require the visual input. This limitation makes it challenging to confirm that the image is truly necessary, particularly in tasks like visual question answering. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and also evaluate their effectiveness in achieving it. We demonstrate the value of our datasets through three approaches. First, we introduce a novel training task based on our augmented training dataset, resulting in better performance than the baseline. Second, we present benchmarks to assess the model's ability to use image as substantive evidence, rather than relying solely on linguistic priors. Finally, we identify attention heads with the strongest vision-language alignment, enabling explainability on visual-driven hallucinations. The code is available at https://github.com/IntelLabs/fivl.