Towards Compatible Fine-tuning for Vision-Language Model Updates

📅 2024-12-30

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

When vision-language foundation models (e.g., CLIP) are upgraded, existing efficient fine-tuning methods—particularly prompt-based approaches—often fail due to shifts in the text encoder’s embedding space. Method: We propose ContCoOp (Conditional Context Optimization), a cross-version compatible dynamic prompt optimization framework. ContCoOp introduces a novel conditional context modeling mechanism that enables learnable prompts to adaptively align with the embedding distributions of different text encoder versions. Within the CLIP architecture, it jointly integrates class embeddings, learnable prompts, and attention mechanisms to generate context-aware prompts. Contribution/Results: Evaluated across 15 benchmark datasets, ContCoOp consistently outperforms state-of-the-art prompt tuning methods. It achieves superior performance in cross-version transfer and out-of-distribution generalization, marking the first systematic solution to the fine-tuning module compatibility problem arising from iterative vision-language model upgrades.

Technology Category

Application Category

📝 Abstract

So far, efficient fine-tuning has become a popular strategy for enhancing the capabilities of foundation models on downstream tasks by learning plug-and-play modules. However, existing methods overlook a crucial issue: if the underlying foundation model is updated, are these plug-and-play modules still effective? In this paper, we first conduct a detailed analysis of various fine-tuning methods on the CLIP in terms of their compatibility with model updates. The study reveals that many high-performing fine-tuning methods fail to be compatible with the upgraded models. To address this, we propose a novel approach, Class-conditioned Context Optimization (ContCoOp), which integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder. Consequently, the prompts can dynamically adapt to the changes in embedding space (due to model updates), ensuring continued effectiveness. Extensive experiments over 15 datasets show that our ContCoOp achieves the highest compatibility over the baseline methods, and exhibits robust out-of-distribution generalization.

Problem

Research questions and friction points this paper is trying to address.

Adaptation

Fine-tuning

Vision-and-Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

ContCoOp

Prompt Tuning

Robust Adaptation

🔎 Similar Papers

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement