π€ AI Summary
This work addresses a critical yet previously overlooked issue in unsupervised cross-domain few-shot learning (CDFSL): target-domain fine-tuning exacerbates the attention sink phenomenon in vision-language models such as CLIP, thereby degrading class discriminability. The study systematically identifies and analyzes this problem for the first time and introduces a dynamic token reweighting mechanism. During fine-tuning, the method adaptively adjusts token weights based on their relevance to target-domain categories, suppressing over-reliance on easily learned tokens and enhancing the modelβs focus on informative, hard-to-learn tokens. This approach effectively mitigates attention collapse and achieves state-of-the-art performance across four benchmark datasets, significantly improving both generalization and discriminative capability in cross-domain few-shot scenarios.
π Abstract
Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model's shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model's reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes are available at https://github.com/shuaiyi308/TIR.