ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the excessive query cost and poor adaptability of existing black-box visual-language model (VLM) prompt tuning methods in query-constrained real-world scenarios, this paper proposes ZoED—the first zeroth-order intrinsic-dimensionality prompt tuning framework. ZoED significantly reduces the optimization dimensionality via low-rank reparameterization of prompts and introduces a hyperparameter-free intrinsic-dimensionality gradient clipping strategy to effectively suppress variance in zeroth-order gradient estimation. Evaluated across 13+ few-shot vision-language tasks, ZoED achieves an average accuracy improvement of ∼6% and a 48% reduction in query cost, substantially outperforming prior black-box prompt tuning approaches and establishing new state-of-the-art performance. The core innovation lies in the principled integration of intrinsic-dimensionality theory with zeroth-order optimization, enabling high-accuracy prompt tuning under stringent query budgets.

Technology Category

Application Category

📝 Abstract

Recent studies have introduced various approaches for prompt-tuning black-box vision-language models, referred to as black-box prompt-tuning (BBPT). While BBPT has demonstrated considerable potential, it is often found that many existing methods require an excessive number of queries (i.e., function evaluations), which poses a significant challenge in real-world scenarios where the number of allowed queries is limited. To tackle this issue, we propose Zeroth-order Intrinsic-dimensional Prompt-tuning (ZIP), a novel approach that enables efficient and robust prompt optimization in a purely black-box setting. The key idea of ZIP is to reduce the problem dimensionality and the variance of zeroth-order gradient estimates, such that the training is done fast with far less queries. We achieve this by re-parameterizing prompts in low-rank representations and designing intrinsic-dimensional clipping of estimated gradients. We evaluate ZIP on 13+ vision-language tasks in standard benchmarks and show that it achieves an average improvement of approximately 6% in few-shot accuracy and 48% in query efficiency compared to the best-performing alternative BBPT methods, establishing a new state of the art. Our ablation analysis further shows that the proposed clipping mechanism is robust and nearly optimal, without the need to manually select the clipping threshold, matching the result of expensive hyperparameter search.

Problem

Research questions and friction points this paper is trying to address.

Efficient prompt-tuning for black-box vision-language models

Reducing excessive queries in black-box prompt optimization

Improving accuracy and query efficiency in few-shot tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank prompt re-parameterization for efficiency

Intrinsic-dimensional gradient clipping for robustness

Zeroth-order optimization with reduced query usage

🔎 Similar Papers

Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models