Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

📅 2024-06-17
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of limited labeled data and poor generalization in few-shot recognition (FSR). We propose Stage-Wise retrieval-Augmented fineTuning (SWAT), a framework built upon pre-trained vision-language models (VLMs) that integrates downstream few-shot examples with open-world retrieved data in two stages: (1) end-to-end fine-tuning on the mixed dataset to initialize shared representations, and (2) classifier retraining exclusively on the few-shot support set to mitigate distribution shift and domain mismatch induced by retrieval data. Crucially, SWAT unifies retrieval-augmented learning (RAL) with stage-wise domain-adaptive fine-tuning. Evaluated on nine mainstream benchmarks, SWAT achieves an average accuracy improvement of over 6% relative to state-of-the-art methods, demonstrating significantly enhanced few-shot generalization capability.

Technology Category

Application Category

📝 Abstract
Few-shot recognition (FSR) aims to train a classification model with only a few labeled examples of each concept concerned by a downstream task, where data annotation cost can be prohibitively high. We develop methods to solve FSR by leveraging a pretrained Vision-Language Model (VLM). We particularly explore retrieval-augmented learning (RAL), which retrieves open data, e.g., the VLM's pretraining dataset, to learn models for better serving downstream tasks. RAL has been studied in zero-shot recognition but remains under-explored in FSR. Although applying RAL to FSR may seem straightforward, we observe interesting and novel challenges and opportunities. First, somewhat surprisingly, finetuning a VLM on a large amount of retrieved data underperforms state-of-the-art zero-shot methods. This is due to the imbalanced distribution of retrieved data and its domain gaps with the few-shot examples in the downstream task. Second, more surprisingly, we find that simply finetuning a VLM solely on few-shot examples significantly outperforms previous FSR methods, and finetuning on the mix of retrieved and few-shot data yields even better results. Third, to mitigate the imbalanced distribution and domain gap issues, we propose Stage-Wise retrieval-Augmented fineTuning (SWAT), which involves end-to-end finetuning on mixed data in the first stage and retraining the classifier on the few-shot data in the second stage. Extensive experiments on nine popular benchmarks demonstrate that SWAT significantly outperforms previous methods by>6% accuracy.
Problem

Research questions and friction points this paper is trying to address.

Addressing few-shot recognition challenges with retrieval-augmented learning
Mitigating data imbalance and domain gaps in VLM finetuning
Proposing stage-wise finetuning to enhance few-shot classification accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented learning with VLMs
Two-stage fine-tuning on mixed data
Classifier retraining on few-shot examples
🔎 Similar Papers
No similar papers found.