🤖 AI Summary
Existing vision-language models (e.g., CLIP) suffer from limited semantic expressivity due to rigid, manually designed or learned prompt templates, leading to poor generalization and erroneous predictions in downstream tasks. To address this, we propose CoKnow—a multi-knowledge representation–driven contextual prompting framework. CoKnow introduces, for the first time, a lightweight semantic knowledge mapper that automatically generates diverse, heterogeneous knowledge representations (e.g., attributes, relations, scenes) directly from input images—without requiring external priors. It further establishes a context-aware dynamic prompt optimization mechanism that leverages these knowledge representations to guide prompt generation. Evaluated on 11 public benchmarks, CoKnow consistently outperforms state-of-the-art methods, significantly improving zero-shot transferability and cross-modal understanding. The code and resources are publicly available.
📝 Abstract
Vision-Language Models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage VLMs' potential in adapting to downstream tasks, context optimization methods like Prompt Tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation restricts the capabilities of pretrained VLMs and can result in incorrect predictions in downstream tasks. To address this challenge, we propose Context Optimization with Multi-Knowledge Representation (CoKnow), a framework that enhances Prompt Learning for VLMs with rich contextual knowledge. To facilitate CoKnow during inference, we trained lightweight semantic knowledge mappers, which are capable of generating Multi-Knowledge Representation for an input image without requiring additional priors. Experimentally, We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods. We will make all resources open-source: https://github.com/EMZucas/CoKnow.