🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from pervasive object hallucination, severely undermining their reliability. To address this, we propose a black-box visual prompting engineering method that dynamically selects optimal visual prompts—e.g., bounding boxes or circles—to suppress hallucination, without requiring model internals access, gradient computation, or fine-tuning. Our key contribution is the first model-agnostic visual prompt routing mechanism: a lightweight router adaptively selects the most suitable prompt from a candidate pool, enabling seamless integration with both open- and closed-source LVLMs. Evaluated on the POPE and CHAIR benchmarks, our approach significantly reduces object hallucination rates while improving response accuracy and cross-model robustness. It offers an efficient, plug-and-play solution for enhancing trustworthy reasoning in LVLMs.
📝 Abstract
Large Vision Language Models (LVLMs) often suffer from object hallucination, which undermines their reliability. Surprisingly, we find that simple object-based visual prompting -- overlaying visual cues (e.g., bounding box, circle) on images -- can significantly mitigate such hallucination; however, different visual prompts (VPs) vary in effectiveness. To address this, we propose Black-Box Visual Prompt Engineering (BBVPE), a framework to identify optimal VPs that enhance LVLM responses without needing access to model internals. Our approach employs a pool of candidate VPs and trains a router model to dynamically select the most effective VP for a given input image. This black-box approach is model-agnostic, making it applicable to both open-source and proprietary LVLMs. Evaluations on benchmarks such as POPE and CHAIR demonstrate that BBVPE effectively reduces object hallucination.