🤖 AI Summary
Existing methods for efficiently adapting vision-language models like CLIP to downstream tasks often suffer from limited performance, excessive parameter counts, or high training costs. This work proposes Feature Projection Learning (FPL), which reformulates classification as a feature projection and reconstruction problem: a lightweight projection module maps class prototypes into the query image feature space, and category scores are derived from the negative mean squared reconstruction error, combined with the original CLIP logits for final prediction. By leveraging this novel formulation, FPL achieves state-of-the-art performance across multiple downstream tasks while maintaining an extremely low number of learnable parameters and minimal training overhead, thereby offering superior adaptation accuracy and efficiency.
📝 Abstract
Vision-Language Pre-Trained models, notably CLIP, that utilize contrastive learning have proven highly adept at extracting generalizable visual features. To inherit the well-learned knowledge of VLP models for downstream tasks, several approaches aim to adapt them efficiently with limited supervision. However, these methods either suffer from limited performance, excessive learnable parameters, or extended training times, all of which hinder their effectiveness in adapting the CLIP model to downstream tasks. In this work, we propose a simple yet efficient and effective method called \textit{\textbf{F}eature \textbf{P}rojection \textbf{L}earning(FPL)} to address these problems. Specifically, we develop a projection model that projects class prototype features into the query image feature space and reconstructs the query image feature map. The negative average squared reconstruction error is used as the class score. In this way, we transform the classification problem into a feature projection problem. The final output of this method is a combination of the prediction from the projection model and the original pre-trained CLIP. Comprehensive empirical evaluations confirm that FPL delivers superior accuracy, surpassing the current state-of-the-art methods by a substantial margin.