AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Handcrafted visual prompts for Large Vision-Language Models (LVLMs) suffer from low efficiency, poor generalization, and suboptimal performance. Method: We propose the first ranking-supervised automatic visual prompt retrieval framework. Our approach establishes an end-to-end pipeline—prompt generation, lightweight quality evaluation using a pre-trained LVLM, and automatic relevance annotation—followed by learning-to-rank to train a plug-and-play lightweight retriever. Contribution/Results: Crucially, we formulate visual prompt optimization as a ranking task, eliminating reliance on manual annotations or model fine-tuning. Experiments demonstrate consistent performance gains across diverse LVLMs: +1.7% accuracy on LLaVA^Wild for LLaVA-OV and +1.9% on MMMU for Qwen2.5-VL, validating both effectiveness and cross-model generalizability.

Technology Category

Application Category

📝 Abstract
Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose extbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we developed an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experimental results indicate that AutoV enhances the performance of various LVLMs across multiple popular image understanding tasks. For instance, LLaVA-OV with AutoV achieves $ extbf{1.7}%$ accuracy gain on LLaVA$^{ ext{Wild}}$, and AutoV boosts Qwen2.5-VL by $ extbf{1.9}%$ on MMMU, highlighting its potential as an optimal visual prompting method for LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Automating optimal visual prompt selection for LVLMs
Overcoming manual design challenges in visual prompting
Enhancing LVLM performance across image understanding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically selects optimal visual prompts
Uses ranking-based supervision for training
Enhances LVLM performance across tasks
🔎 Similar Papers
No similar papers found.
Y
Yuan Zhang
School of Computer Science, Peking University
Chun-Kai Fan
Chun-Kai Fan
Peking University
T
Tao Huang
Shanghai Jiao Tong University
M
Ming Lu
School of Computer Science, Peking University
S
Sicheng Yu
ByteDance
Junwen Pan
Junwen Pan
ByteDance
Deep LearningMachine LearningImage Segmentation
Kuan Cheng
Kuan Cheng
Peking University
Theory of ComputationPseudorandomnessCoding TheoryArtificial Intelligence
Q
Qi She
ByteDance
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models