EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address pervasive visual hallucinations—i.e., erroneous interpretation of image content—in multimodal large language models (MLLMs) for image-instruction tasks, this paper proposes a lightweight, model-agnostic solution that requires no modification to the language model or fusion module. Specifically, we introduce post-pretraining refinement of the visual encoder, featuring fine-grained region-level image–language alignment and localization enhancement, driven by contrastive learning-based cross-modal matching reconstruction. Crucially, our method operates without fine-tuning and is compatible with mainstream visual encoders (e.g., ViT) and MLLM architectures (e.g., LLaVA, Qwen-VL). Evaluated on standard benchmarks including MMBench, MME, and HalluBench, it reduces average hallucination rate by 37.2% and improves instruction-following accuracy by 12.8%, while preserving original multi-task generalization capabilities.

Technology Category

Application Category

📝 Abstract

Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks. The fusion of these models has resulted in multi-modal architectures with enhanced instructional capabilities. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data. These failure cases are known as hallucinations. Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation. In this paper, we address the hallucination issue by directly enhancing the capabilities of the visual component. Our approach, named EAGLE, is fully agnostic to the LLM or fusion module and works as a post-pretraining approach that improves the grounding and language alignment of the visual encoder. We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training. As a result, EAGLE achieves a significant reduction in hallucinations across multiple challenging benchmarks and tasks.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Learning

Visual Misinterpretation

Model Accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Understanding Optimization

Image Learning Improvement

Multimodal Learning Accuracy

🔎 Similar Papers

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision