Draft and Refine with Visual Experts

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) frequently over-rely on linguistic priors while neglecting visual evidence, leading to hallucination and unreliable reasoning. To address this, we propose a “Draft and Refine” framework that introduces the *Visual Utilization Score* (VUS)—the first metric explicitly defined and quantified to measure how effectively an LVLM leverages visual input. VUS is computed from question-conditioned relevance maps and integrated with localization cues from external visual experts (e.g., object detectors or segmentation models) to generate probabilistic visual masks. These masks iteratively refine model responses without architectural modifications or fine-tuning. Our approach strengthens visual grounding and evidence-driven inference. Extensive experiments on VQA and image captioning demonstrate significant improvements in accuracy, substantial reduction in hallucination, and strong cross-task generalization—validating its effectiveness, interpretability, and broad applicability.

Technology Category

Application Category

📝 Abstract
While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems.
Problem

Research questions and friction points this paper is trying to address.

Quantifying visual information reliance in multimodal models to reduce hallucinations
Developing a framework that improves grounding without retraining or architectural changes
Measuring visual utilization to create interpretable evidence-driven agent systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifies visual evidence reliance via relevance-guided masking
Refines responses using external visual experts' rendered cues
Improves grounding without retraining through iterative response selection
🔎 Similar Papers
No similar papers found.