Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Continual learning with pretrained multimodal models (e.g., CLIP) suffers from catastrophic forgetting, while existing methods rely on complex architectural modifications. Method: We propose Text Prototype-guided Prompt Tuning for Vision-Text continual learning (TPPT-VT), a lightweight prompt-based adaptation framework. It introduces learnable, stable text prototypes as dynamic anchors to guide task-adaptive optimization of visual prompts; and employs bidirectional supervision and relation-diversity regularization to mitigate embedding collapse and preserve cross-modal associations. Contribution/Results: TPPT-VT fine-tunes only a small set of prompt parameters, fully leveraging CLIP’s inherent textual stability and vision-language alignment. On multiple standard continual learning benchmarks, it significantly outperforms state-of-the-art methods in both forward transfer and backward retention—demonstrating the effectiveness of the “text-guides-vision” paradigm for continual multimodal learning.

Technology Category

Application Category

📝 Abstract
Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, providing rich multi-modal embeddings that support lightweight, incremental prompt tuning. Existing methods often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementations, that introduce additional-and possibly unnecessary-complexity, underutilizing CLIP's intrinsic capabilities. In this paper, we propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we jointly optimizes visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP's intrinsic guidance for continual adaptation.
Problem

Research questions and friction points this paper is trying to address.

Enables CLIP to learn new tasks without forgetting old ones
Reduces complexity by leveraging CLIP's intrinsic multi-modal capabilities
Improves continual learning via bidirectional visual-textual prompt tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incremental prompt tuning for CLIP
Textual prototypes guide visual prompts
Bidirectional supervision reduces forgetting
🔎 Similar Papers
No similar papers found.