CLIP-Adapter: Better Vision-Language Models with Feature Adapters

📅 2021-10-09
🏛️ International Journal of Computer Vision
📈 Citations: 897
Influential: 126
📄 PDF

career value

177K/year
🤖 AI Summary
To address the dependency of downstream transfer for vision-language models on complex prompt engineering, this paper proposes CLIP-Adapter: a parameter-efficient fine-tuning method that inserts lightweight feature adapters—featuring bottleneck architectures and residual feature fusion—into either the visual or textual branch of pretrained CLIP. Unlike prevailing prompt-tuning paradigms (e.g., CoOp), CLIP-Adapter introduces feature adaptation—a novel mechanism for vision-language models—without modifying prompts or altering the original model architecture. This design preserves structural simplicity while significantly enhancing generalization across tasks. Extensive experiments demonstrate consistent superiority over state-of-the-art prompt-tuning methods on multiple image classification benchmarks. Ablation studies validate the effectiveness and cross-task transferability of each component. Overall, CLIP-Adapter establishes a new paradigm for adapting vision-language models without requiring prompt engineering.
📝 Abstract
Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Improving vision-language models without prompt engineering
Fine-tuning visual and language branches with feature adapters
Enhancing CLIP performance with residual feature blending
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature adapters for vision-language fine-tuning
Residual-style blending with original features
Bottleneck layer learning new features