ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

📅 2024-02-09

🏛️ European Conference on Computer Vision

📈 Citations: 29

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) suffer from misalignment between visual and textual representations, manifesting as hallucinated generations, omission of salient visual elements, and erroneous reasoning about object attributes and spatial relations. To address this, we propose ViGoR, a fine-grained visual grounding reward modeling framework that introduces the first reward design paradigm explicitly targeting localization accuracy—requiring only low-cost human evaluations rather than expensive, pixel-level supervision. We construct and publicly release ViGoR-Bench, the first fine-grained visual grounding benchmark comprising 15,440 image–text pairs. ViGoR integrates human feedback reinforcement learning (RLHF), automated reward synthesis, and LVLM alignment optimization. Experiments demonstrate substantial improvements in LVLM performance across multiple visual grounding and reasoning benchmarks, with significant reductions in hallucination and mislocalization. Moreover, ViGoR achieves these gains at roughly one-tenth the training cost of supervised fine-tuning.

Technology Category

Application Category

📝 Abstract

By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes of and relationships between objects. To address these issues, we introduce a novel framework, ViGoR (Visual Grounding Through Fine-Grained Reward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through a variety of evaluation methods and benchmarks. Additionally, we released our human annotation (https://github.com/amazon-science/vigor) comprising 15,440 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.

Problem

Research questions and friction points this paper is trying to address.

Improves visual grounding in large vision language models

Addresses hallucination and inaccurate object attribute inference

Uses fine-grained reward modeling with cheaper human evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained reward modeling improves visual grounding

Utilizes cheaper human evaluations instead of full supervision

Automated methods enhance large vision language models

🔎 Similar Papers

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring