Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing few-shot evaluation protocols for CLIP suffer from severe transductive bias: most benchmark datasets overlap with CLIP’s pretraining data, rendering evaluations “pseudo-inductive” and inflating estimates of true generalization. Method: This work is the first to identify and mitigate this transductive bias, proposing a purely inductive few-shot evaluation framework grounded in model forgetting—systematically erasing CLIP’s prior knowledge of test classes and semantically similar concepts to construct an unbiased benchmark. Contribution/Results: We conduct 5,880 experiments across five datasets, twelve random seeds, and three shot levels (1–16), revealing a 55% average performance drop across thirteen baselines. Our framework consistently achieves state-of-the-art results on the new unbiased benchmark, substantially enhancing the validity and reliability of few-shot evaluation for vision-language models.

Technology Category

Application Category

📝 Abstract
CLIP is a foundational model with transferable classification performance in the few-shot setting. Several methods have shown improved performance of CLIP using few-shot examples. However, so far, all these techniques have been benchmarked using standard few-shot datasets. We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability using few-shot examples. As most datasets have been seen by the CLIP model, the resultant setting can be termed as partially transductive. To solve this, we propose a pipeline that uses an unlearning technique to obtain true inductive baselines. In this new inductive setting, the methods show a significant drop in performance (-55% on average among 13 baselines with multiple datasets). We validate the unlearning technique using oracle baselines. An improved few-shot classification technique is proposed that consistently obtains state-of-the-art performance over 13 other recent baseline methods on a comprehensive analysis with 5880 experiments - varying the datasets, differing number of few-shot examples, unlearning setting, and with different seeds. Thus, we identify the issue with the evaluation of CLIP-based few-shot classification, provide a solution using unlearning, propose new benchmarks, and provide an improved method.
Problem

Research questions and friction points this paper is trying to address.

Evaluating CLIP's inductive generalization in few-shot learning
Addressing partially transductive bias in few-shot benchmarks
Proposing unlearning-based true inductive baselines for CLIP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unlearning technique for true inductive baselines
Improved few-shot classification method
Comprehensive evaluation with 5880 experiments
🔎 Similar Papers
No similar papers found.