Learning to Adapt Frozen CLIP for Few-Shot Test-Time Domain Adaptation

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This paper addresses the severe performance degradation of frozen CLIP models under few-shot test-time adaptation (TTA) due to domain shift—particularly pronounced with lightweight backbones like ViT-B/16. We propose the first input-space co-learning framework tailored for CLIP. Methodologically: (1) a learnable side-branch network coupled with inverse attention constraints enforces feature-space alignment; (2) greedy text ensemble and semantic refinement enhance prompt discriminability; (3) generative domain prompts integrated with progressive vision-language fusion enable domain-aware cross-modal alignment. Evaluated on five real-world benchmarks from WILDS and DomainNet, our method achieves new state-of-the-art results: +5.1 F1 on iWildCam and +3.1% weighted accuracy on FMoW.

Technology Category

Application Category

📝 Abstract

Few-shot Test-Time Domain Adaptation focuses on adapting a model at test time to a specific domain using only a few unlabeled examples, addressing domain shift. Prior methods leverage CLIP's strong out-of-distribution (OOD) abilities by generating domain-specific prompts to guide its generalized, frozen features. However, since downstream datasets are not explicitly seen by CLIP, solely depending on the feature space knowledge is constrained by CLIP's prior knowledge. Notably, when using a less robust backbone like ViT-B/16, performance significantly drops on challenging real-world benchmarks. Departing from the state-of-the-art of inheriting the intrinsic OOD capability of CLIP, this work introduces learning directly on the input space to complement the dataset-specific knowledge for frozen CLIP. Specifically, an independent side branch is attached in parallel with CLIP and enforced to learn exclusive knowledge via revert attention. To better capture the dataset-specific label semantics for downstream adaptation, we propose to enhance the inter-dispersion among text features via greedy text ensemble and refinement. The text and visual features are then progressively fused in a domain-aware manner by a generated domain prompt to adapt toward a specific domain. Extensive experiments show our method's superiority on 5 large-scale benchmarks (WILDS and DomainNet), notably improving over smaller networks like ViT-B/16 with gains of extbf{+5.1} in F1 for iWildCam and extbf{+3.1%} in WC Acc for FMoW.

Problem

Research questions and friction points this paper is trying to address.

Adapting frozen CLIP for few-shot test-time domain adaptation

Enhancing dataset-specific knowledge beyond CLIP's prior constraints

Improving performance on real-world benchmarks with smaller backbones

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning on input space complements frozen CLIP

Side branch learns via revert attention

Greedy text ensemble enhances label semantics

🔎 Similar Papers

Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia