A Retrospect to Multi-prompt Learning across Vision and Language

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing vision-language models (VLMs) predominantly adopt single-prompt paradigms, with limited systematic investigation into multi-prompt learning. Method: This paper introduces the first energy-driven multi-prompt learning framework, which models the distribution of prompt embeddings via a learnable energy function to adaptively generate diverse, semantically complementary prompt ensembles—enabling parameter-efficient adaptation of VLMs. Contribution/Results: The method incurs no additional inference overhead while significantly enhancing open-vocabulary generalization. It achieves balanced in-domain accuracy and out-of-domain robustness under cross-domain transfer. Theoretical analysis and extensive experiments demonstrate that our mechanism consistently outperforms mainstream prompt-learning approaches across multiple downstream tasks, delivering substantial and consistent performance gains.

Technology Category

Application Category

📝 Abstract

The vision community is undergoing the unprecedented progress with the emergence of Vision-Language Pretraining Models (VLMs). Prompt learning plays as the holy grail of accessing VLMs since it enables their fast adaptation to downstream tasks with limited resources. Whereas existing researches milling around single-prompt paradigms, rarely investigate the technical potential behind their multi-prompt learning counterparts. This paper aims to provide a principled retrospect for vision-language multi-prompt learning. We extend the recent constant modality gap phenomenon to learnable prompts and then, justify the superiority of vision-language transfer with multi-prompt augmentation, empirically and theoretically. In terms of this observation, we propose an Energy-based Multi-prompt Learning (EMPL) to generate multiple prompt embeddings by drawing instances from an energy-based distribution, which is implicitly defined by VLMs. So our EMPL is not only parameter-efficient but also rigorously lead to the balance between in-domain and out-of-domain open-vocabulary generalization. Comprehensive experiments have been conducted to justify our claims and the excellence of EMPL.

Problem

Research questions and friction points this paper is trying to address.

Investigating multi-prompt learning potential in vision-language models

Extending modality gap theory to learnable prompts for transfer

Achieving balance between in-domain and out-of-domain generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends modality gap phenomenon to learnable prompts

Proposes energy-based distribution for prompt generation

Balances in-domain and out-of-domain generalization

🔎 Similar Papers

No similar papers found.