Feature Projection Learning for Better Vision-Language Reasoning

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing methods for efficiently adapting vision-language models like CLIP to downstream tasks often suffer from limited performance, excessive parameter counts, or high training costs. This work proposes Feature Projection Learning (FPL), which reformulates classification as a feature projection and reconstruction problem: a lightweight projection module maps class prototypes into the query image feature space, and category scores are derived from the negative mean squared reconstruction error, combined with the original CLIP logits for final prediction. By leveraging this novel formulation, FPL achieves state-of-the-art performance across multiple downstream tasks while maintaining an extremely low number of learnable parameters and minimal training overhead, thereby offering superior adaptation accuracy and efficiency.

Technology Category

Application Category

📝 Abstract

Vision-Language Pre-Trained models, notably CLIP, that utilize contrastive learning have proven highly adept at extracting generalizable visual features. To inherit the well-learned knowledge of VLP models for downstream tasks, several approaches aim to adapt them efficiently with limited supervision. However, these methods either suffer from limited performance, excessive learnable parameters, or extended training times, all of which hinder their effectiveness in adapting the CLIP model to downstream tasks. In this work, we propose a simple yet efficient and effective method called \textit{\textbf{F}eature \textbf{P}rojection \textbf{L}earning(FPL)} to address these problems. Specifically, we develop a projection model that projects class prototype features into the query image feature space and reconstructs the query image feature map. The negative average squared reconstruction error is used as the class score. In this way, we transform the classification problem into a feature projection problem. The final output of this method is a combination of the prediction from the projection model and the original pre-trained CLIP. Comprehensive empirical evaluations confirm that FPL delivers superior accuracy, surpassing the current state-of-the-art methods by a substantial margin.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Pre-Trained models

CLIP adaptation

downstream tasks

limited supervision

model efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature Projection Learning

CLIP adaptation

prototype projection