🤖 AI Summary
Existing few-shot evaluation protocols for CLIP suffer from severe transductive bias: most benchmark datasets overlap with CLIP’s pretraining data, rendering evaluations “pseudo-inductive” and inflating estimates of true generalization. Method: This work is the first to identify and mitigate this transductive bias, proposing a purely inductive few-shot evaluation framework grounded in model forgetting—systematically erasing CLIP’s prior knowledge of test classes and semantically similar concepts to construct an unbiased benchmark. Contribution/Results: We conduct 5,880 experiments across five datasets, twelve random seeds, and three shot levels (1–16), revealing a 55% average performance drop across thirteen baselines. Our framework consistently achieves state-of-the-art results on the new unbiased benchmark, substantially enhancing the validity and reliability of few-shot evaluation for vision-language models.
📝 Abstract
CLIP is a foundational model with transferable classification performance in the few-shot setting. Several methods have shown improved performance of CLIP using few-shot examples. However, so far, all these techniques have been benchmarked using standard few-shot datasets. We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability using few-shot examples. As most datasets have been seen by the CLIP model, the resultant setting can be termed as partially transductive. To solve this, we propose a pipeline that uses an unlearning technique to obtain true inductive baselines. In this new inductive setting, the methods show a significant drop in performance (-55% on average among 13 baselines with multiple datasets). We validate the unlearning technique using oracle baselines. An improved few-shot classification technique is proposed that consistently obtains state-of-the-art performance over 13 other recent baseline methods on a comprehensive analysis with 5880 experiments - varying the datasets, differing number of few-shot examples, unlearning setting, and with different seeds. Thus, we identify the issue with the evaluation of CLIP-based few-shot classification, provide a solution using unlearning, propose new benchmarks, and provide an improved method.