🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from misalignment between visual and textual representations, manifesting as hallucinated generations, omission of salient visual elements, and erroneous reasoning about object attributes and spatial relations. To address this, we propose ViGoR, a fine-grained visual grounding reward modeling framework that introduces the first reward design paradigm explicitly targeting localization accuracy—requiring only low-cost human evaluations rather than expensive, pixel-level supervision. We construct and publicly release ViGoR-Bench, the first fine-grained visual grounding benchmark comprising 15,440 image–text pairs. ViGoR integrates human feedback reinforcement learning (RLHF), automated reward synthesis, and LVLM alignment optimization. Experiments demonstrate substantial improvements in LVLM performance across multiple visual grounding and reasoning benchmarks, with significant reductions in hallucination and mislocalization. Moreover, ViGoR achieves these gains at roughly one-tenth the training cost of supervised fine-tuning.
📝 Abstract
By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes of and relationships between objects. To address these issues, we introduce a novel framework, ViGoR (Visual Grounding Through Fine-Grained Reward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through a variety of evaluation methods and benchmarks. Additionally, we released our human annotation (https://github.com/amazon-science/vigor) comprising 15,440 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.