🤖 AI Summary
This paper addresses the imbalanced generalization performance of few-shot vision-language models (VLMs) on in-distribution (ID) versus out-of-distribution (OOD) data. We propose the Stage-wise Retrieval-Augmented and Partially Fine-tuned (SRAPF) framework. Methodologically, SRAPF integrates pretraining-data retrieval augmentation, localized fine-tuning of only the top layers of the visual encoder, and a two-stage co-optimization strategy. Our key contributions are: (i) identifying that fine-tuning solely the top visual encoder layers achieves an optimal trade-off between ID and OOD accuracy; and (ii) uncovering a complementary mechanism wherein retrieval augmentation improves OOD robustness while input-level adversarial perturbations enhance ID accuracy. On the ImageNet OOD benchmark, SRAPF achieves state-of-the-art performance on both ID and OOD accuracy, significantly outperforming mainstream few-shot adaptation methods—including prompt tuning and linear probing—while maintaining parameter efficiency.
📝 Abstract
Pretrained VLMs achieve strong performance on downstream tasks when adapted with just a few labeled examples. As the adapted models inevitably encounter out-of-distribution (OOD) test data that deviates from the in-distribution (ID) task-specific training data, enhancing OOD generalization in few-shot adaptation is critically important. We study robust few-shot VLM adaptation, aiming to increase both ID and OOD accuracy. By comparing different adaptation methods (e.g., prompt tuning, linear probing, contrastive finetuning, and full finetuning), we uncover three key findings: (1) finetuning with proper hyperparameters significantly outperforms the popular VLM adaptation methods prompt tuning and linear probing; (2) visual encoder-only finetuning achieves better efficiency and accuracy than contrastively finetuning both visual and textual encoders; (3) finetuning the top layers of the visual encoder provides the best balance between ID and OOD accuracy. Building on these findings, we propose partial finetuning of the visual encoder empowered with two simple augmentation techniques: (1) retrieval augmentation which retrieves task-relevant data from the VLM's pretraining dataset to enhance adaptation, and (2) adversarial perturbation which promotes robustness during finetuning. Results show that the former/latter boosts OOD/ID accuracy while slightly sacrificing the ID/OOD accuracy. Yet, perhaps understandably, naively combining the two does not maintain their best OOD/ID accuracy. We address this dilemma with the developed SRAPF, Stage-wise Retrieval Augmentation-based Adversarial Partial Finetuning. SRAPF consists of two stages: (1) partial finetuning the visual encoder using both ID and retrieved data, and (2) adversarial partial finetuning with few-shot ID data. Extensive experiments demonstrate that SRAPF achieves the state-of-the-art ID and OOD accuracy on the ImageNet OOD benchmarks.