Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Vision Transformers (ViTs) exhibit poor generalization in cross-domain few-shot learning (CDFSL) due to substantial inter-domain distribution shifts. This work first identifies that image token continuity reinforces large-scale spatial patterns, exacerbating domain shift and degrading performance on distant target domains; further analysis reveals such continuity is critical in the source domain but has negligible impact in the target domain. To address this, we propose a lightweight token reordering mechanism: by deliberately arranging non-contiguous pixel-level patches, we perturb token sequences to steer the model toward transferable small-scale local features—without introducing extra parameters. Integrated with attention-aware analysis and cross-domain feature disentanglement training, our method significantly outperforms state-of-the-art approaches across multiple CDFSL benchmarks, effectively narrowing inter-domain distribution gaps. Code and models are publicly available.

Technology Category

Application Category

📝 Abstract

Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention's insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens' continuity in ViT's generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies that only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness of our method in reducing domain gaps and outperforming state-of-the-art works. Codes and models are available at https://github.com/shuaiyi308/ReCIT.

Problem

Research questions and friction points this paper is trying to address.

Explores image token continuity impact on ViT generalization across domains

Investigates large vs small pattern transfer under domain gaps

Proposes method to disrupt token continuity for better few-shot learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disrupts image tokens continuity in ViT

Encourages reliance on smaller spatial patterns

Reduces domain gaps in few-shot learning

🔎 Similar Papers

TAVP: Task-Adaptive Visual Prompt for Cross-domain Few-shot Segmentation