Improving CLIP Adaptation by Breaking Tail Alignment for Source-Free Cross-Domain Few-Shot Learning

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

CLIP suffers significant performance degradation in cross-domain few-shot learning due to domain shift and scarce labeled data. This work identifies, for the first time, that enforcing alignment of low-semantic “tail image patches” harms model generalization. To address this, the authors propose an adaptive head-tail alignment mechanism that dynamically attenuates alignment strength for tail regions while enhancing it for high-semantic regions, based on patch-text similarity analysis. This strategy mitigates overfitting in few-shot scenarios by avoiding uniform alignment across all image regions. The approach breaks away from conventional uniform alignment paradigms and achieves state-of-the-art performance across four cross-domain few-shot benchmarks, demonstrating its effectiveness and robustness.

📝 Abstract

Vision-Language Models (VLMs) such as CLIP demonstrate strong zero-shot generalization, but their performance significantly degrades in cross-domain scenarios with scarce target-domain training data (Cross-Domain Few-Shot Learning, CDFSL). In this paper, we focus on the target-domain few-shot finetuning in the CLIP-based CDFSL task. Prevailing finetuning paradigms uniformly align all image patch tokens with their corresponding textual embeddings. However, we find a counterintuitive phenomenon: actively pushing away certain low-similarity image tokens, termed "tail tokens", from their textual embeddings consistently improves target-domain performance. We delve into this phenomenon and provide a novel interpretation: under great domain shifts and scarce training data, the model can hardly extract semantic information from visual inputs; therefore, the common belief of alignment is valid only for tokens already containing sufficient semantic information; for tail tokens, forcing the alignment would lead to excessive overfitting to the scarce training, while breaking the alignment is more useful. Motivated by this, we propose Adaptive Tail-Head Alignment (ATHA), a novel fine-tuning strategy for CLIP that transforms the conventional uniform alignment paradigm to an adaptive alignment paradigm, with both alignment strengthening and weakening. Extensive experiments on four challenging CDFSL benchmarks validate our state-of-the-art performance. Our code is available at https://github.com/shuaiyi308/ATHA.

Problem

Research questions and friction points this paper is trying to address.

Cross-Domain Few-Shot Learning

CLIP

Vision-Language Models

Domain Shift

Few-Shot Fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Alignment

Tail Tokens

Source-Free CDFSL