GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

๐Ÿ“… 2025-11-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Fine-tuning video-language models often degrades generalization to unseen categories, while existing prompt tuning methods compromise soft prompt learning capability when mitigating catastrophic forgetting. To address this, we propose a plug-and-play coupled prompt learning framework: it integrates pretrained hard prompts with learnable mapping layers for soft prompts and constructs universal semantic anchors using irrelevant video clips and negative prompts to alleviate semantic space collapse. Our method jointly optimizes text- and vision-modality soft prompts, transfers hard prompts across datasets, designs negative-sample prompts, and refines the mapping networkโ€”thereby balancing learnability and generalization. Experiments on multiple video understanding benchmarks demonstrate that our approach significantly outperforms state-of-the-art prompt tuning methods, especially in zero-shot and few-shot transfer from base to novel classes, where it achieves superior generalization performance.

Technology Category

Application Category

๐Ÿ“ Abstract
Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.
Problem

Research questions and friction points this paper is trying to address.

Mitigates semantic space narrowing in video-language models
Introduces generic attribute anchors to preserve generalization
Optimizes prompt tuning to prevent overfitting to supervised categories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coupling hard and soft prompts via learnable mapping layer
Using irrelevant video sets as generic attribute anchors
Introducing negative prompts to preserve semantic space generalization
๐Ÿ”Ž Similar Papers
No similar papers found.
B
Bin Wang
Shandong University of Technology
R
Ruotong Hu
Shandong University of Technology
Wenqian Wang
Wenqian Wang
Singapore University of Technology and Design; Nanyang Technology University; Shandong University
Computer VisionAnomaly DetectionAction RecognitionAction Prediction
Wentong Li
Wentong Li
Nanjing University of Aeronautics and Astronautics
Computer VisionMachine LearningVision-Language ModelRobotics
M
Mingliang Gao
Shandong University of Technology
R
Runmin Cong
Shandong University
W
Wei Zhang
Shandong University