Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) suffer from pervasive object hallucination, severely undermining their reliability. To address this, we propose a black-box visual prompting engineering method that dynamically selects optimal visual prompts—e.g., bounding boxes or circles—to suppress hallucination, without requiring model internals access, gradient computation, or fine-tuning. Our key contribution is the first model-agnostic visual prompt routing mechanism: a lightweight router adaptively selects the most suitable prompt from a candidate pool, enabling seamless integration with both open- and closed-source LVLMs. Evaluated on the POPE and CHAIR benchmarks, our approach significantly reduces object hallucination rates while improving response accuracy and cross-model robustness. It offers an efficient, plug-and-play solution for enhancing trustworthy reasoning in LVLMs.

Technology Category

Application Category

📝 Abstract

Large Vision Language Models (LVLMs) often suffer from object hallucination, which undermines their reliability. Surprisingly, we find that simple object-based visual prompting -- overlaying visual cues (e.g., bounding box, circle) on images -- can significantly mitigate such hallucination; however, different visual prompts (VPs) vary in effectiveness. To address this, we propose Black-Box Visual Prompt Engineering (BBVPE), a framework to identify optimal VPs that enhance LVLM responses without needing access to model internals. Our approach employs a pool of candidate VPs and trains a router model to dynamically select the most effective VP for a given input image. This black-box approach is model-agnostic, making it applicable to both open-source and proprietary LVLMs. Evaluations on benchmarks such as POPE and CHAIR demonstrate that BBVPE effectively reduces object hallucination.

Problem

Research questions and friction points this paper is trying to address.

Mitigating object hallucination in Large Vision Language Models

Identifying optimal visual prompts without model internals

Reducing hallucination via dynamic visual prompt selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses visual cues like bounding boxes to reduce hallucinations

Trains router model to select optimal visual prompts

Black-box approach works with any large vision language model

🔎 Similar Papers

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models