Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models

📅 2024-04-16
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
Existing vision-language models (e.g., CLIP) suffer from limited semantic expressivity due to rigid, manually designed or learned prompt templates, leading to poor generalization and erroneous predictions in downstream tasks. To address this, we propose CoKnow—a multi-knowledge representation–driven contextual prompting framework. CoKnow introduces, for the first time, a lightweight semantic knowledge mapper that automatically generates diverse, heterogeneous knowledge representations (e.g., attributes, relations, scenes) directly from input images—without requiring external priors. It further establishes a context-aware dynamic prompt optimization mechanism that leverages these knowledge representations to guide prompt generation. Evaluated on 11 public benchmarks, CoKnow consistently outperforms state-of-the-art methods, significantly improving zero-shot transferability and cross-modal understanding. The code and resources are publicly available.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage VLMs' potential in adapting to downstream tasks, context optimization methods like Prompt Tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation restricts the capabilities of pretrained VLMs and can result in incorrect predictions in downstream tasks. To address this challenge, we propose Context Optimization with Multi-Knowledge Representation (CoKnow), a framework that enhances Prompt Learning for VLMs with rich contextual knowledge. To facilitate CoKnow during inference, we trained lightweight semantic knowledge mappers, which are capable of generating Multi-Knowledge Representation for an input image without requiring additional priors. Experimentally, We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods. We will make all resources open-source: https://github.com/EMZucas/CoKnow.
Problem

Research questions and friction points this paper is trying to address.

Lack of diverse prompt templates limits VLM performance
Insufficient contextual knowledge causes incorrect downstream predictions
Need to enhance prompt learning with multi-knowledge representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-knowledge representation enhances prompt learning
Lightweight semantic mappers generate contextual knowledge
Framework improves vision-language model adaptation capabilities
🔎 Similar Papers
No similar papers found.
E
Enming Zhang
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Bingke Zhu
Bingke Zhu
Institute of Automation,Chinese Academy of Science
Y
Yingying Chen
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Q
Qinghai Miao
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
M
Ming Tang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China