Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from pervasive object hallucination, severely undermining their reliability. To address this, we propose a black-box visual prompting engineering method that dynamically selects optimal visual prompts—e.g., bounding boxes or circles—to suppress hallucination, without requiring model internals access, gradient computation, or fine-tuning. Our key contribution is the first model-agnostic visual prompt routing mechanism: a lightweight router adaptively selects the most suitable prompt from a candidate pool, enabling seamless integration with both open- and closed-source LVLMs. Evaluated on the POPE and CHAIR benchmarks, our approach significantly reduces object hallucination rates while improving response accuracy and cross-model robustness. It offers an efficient, plug-and-play solution for enhancing trustworthy reasoning in LVLMs.

Technology Category

Application Category

📝 Abstract
Large Vision Language Models (LVLMs) often suffer from object hallucination, which undermines their reliability. Surprisingly, we find that simple object-based visual prompting -- overlaying visual cues (e.g., bounding box, circle) on images -- can significantly mitigate such hallucination; however, different visual prompts (VPs) vary in effectiveness. To address this, we propose Black-Box Visual Prompt Engineering (BBVPE), a framework to identify optimal VPs that enhance LVLM responses without needing access to model internals. Our approach employs a pool of candidate VPs and trains a router model to dynamically select the most effective VP for a given input image. This black-box approach is model-agnostic, making it applicable to both open-source and proprietary LVLMs. Evaluations on benchmarks such as POPE and CHAIR demonstrate that BBVPE effectively reduces object hallucination.
Problem

Research questions and friction points this paper is trying to address.

Mitigating object hallucination in Large Vision Language Models
Identifying optimal visual prompts without model internals
Reducing hallucination via dynamic visual prompt selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses visual cues like bounding boxes to reduce hallucinations
Trains router model to select optimal visual prompts
Black-box approach works with any large vision language model