Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the challenges of task-irrelevant background noise interfering with image prototypes and insufficient cross-modal alignment in CLIP-based few-shot image classification. To this end, the authors propose a training-free refinement approach that first constructs a text-aligned semantic image subspace and projects image prototypes onto this subspace to enhance cross-modal consistency. Subsequently, they integrate textual and visual information into hybrid prototypes and model class-conditional anisotropic distributions via class-specific covariance estimates, which are incorporated into an image-adapted linear discriminant analysis (LDA) classifier. The proposed method achieves significant performance gains over existing techniques across multiple few-shot benchmarks, effectively mitigating the adverse effects of background noise and improving classification accuracy.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.

Problem

Research questions and friction points this paper is trying to address.

few-shot classification

cross-modal alignment

vision-language models

prototype mixing

semantic subspace

Innovation

Methods, ideas, or system contributions that make the work stand out.

prototype mixing

cross-modal alignment

semantic subspace projection