🤖 AI Summary
This work addresses the performance gap in cross-domain few-shot learning under source-domain unavailability, where prompt-based fine-tuning methods for vision-language models (e.g., CLIP) significantly underperform adapter-based approaches. The study reveals that adapters such as LoRA enhance modality alignment and class separability by mitigating attention collapse in the visual [CLS] token. Building on this insight, the authors propose Semantic Probe—a plug-and-play, general-purpose attention rectification framework—that uniformly boosts the performance of both prompt-based (e.g., MaPLe) and adapter-based methods. Evaluated across four cross-domain few-shot benchmarks, Semantic Probe consistently achieves state-of-the-art results, demonstrating its effectiveness and broad applicability.
📝 Abstract
Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.