Feature Projection Learning for Better Vision-Language Reasoning

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for efficiently adapting vision-language models like CLIP to downstream tasks often suffer from limited performance, excessive parameter counts, or high training costs. This work proposes Feature Projection Learning (FPL), which reformulates classification as a feature projection and reconstruction problem: a lightweight projection module maps class prototypes into the query image feature space, and category scores are derived from the negative mean squared reconstruction error, combined with the original CLIP logits for final prediction. By leveraging this novel formulation, FPL achieves state-of-the-art performance across multiple downstream tasks while maintaining an extremely low number of learnable parameters and minimal training overhead, thereby offering superior adaptation accuracy and efficiency.

Technology Category

Application Category

📝 Abstract
Vision-Language Pre-Trained models, notably CLIP, that utilize contrastive learning have proven highly adept at extracting generalizable visual features. To inherit the well-learned knowledge of VLP models for downstream tasks, several approaches aim to adapt them efficiently with limited supervision. However, these methods either suffer from limited performance, excessive learnable parameters, or extended training times, all of which hinder their effectiveness in adapting the CLIP model to downstream tasks. In this work, we propose a simple yet efficient and effective method called \textit{\textbf{F}eature \textbf{P}rojection \textbf{L}earning(FPL)} to address these problems. Specifically, we develop a projection model that projects class prototype features into the query image feature space and reconstructs the query image feature map. The negative average squared reconstruction error is used as the class score. In this way, we transform the classification problem into a feature projection problem. The final output of this method is a combination of the prediction from the projection model and the original pre-trained CLIP. Comprehensive empirical evaluations confirm that FPL delivers superior accuracy, surpassing the current state-of-the-art methods by a substantial margin.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Pre-Trained models
CLIP adaptation
downstream tasks
limited supervision
model efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature Projection Learning
CLIP adaptation
prototype projection
feature reconstruction
vision-language reasoning
🔎 Similar Papers
No similar papers found.
Y
Yi Zhang
College of Computer Science and Software Engineering, Shenzhen University, China
W
Weicheng Lin
College of Computer Science and Software Engineering, Shenzhen University, China
Liang-Jie Zhang
Liang-Jie Zhang
Distinguished Professor@Shenzhen University (SZU), ACM DS & IEEE Fellow, ex-RSM@IBM & ex-CTO@Kingdee
Services ComputingAIBlockchain & IOTSOA & Cloud ComputingDigital Transformation