Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 1

career value

182K/year

🤖 AI Summary

To address the severe performance degradation of vision-language models (e.g., CLIP) under distribution shift during testing, this paper proposes a fine-tuning-free test-time adaptation method. The approach treats predefined class text embeddings as fixed semantic centroids and formulates label assignment as an optimal transport problem to generate high-quality pseudo-labels. Additionally, it introduces a multi-template knowledge distillation mechanism that effectively emulates multi-view contrastive learning at zero additional computational cost. This work is the first to explicitly model text embeddings as fixed semantic centroids for test-time adaptation. Evaluated on multiple standard benchmarks, the method achieves an average accuracy improvement of 7%, significantly outperforming existing state-of-the-art methods, while maintaining minimal computational and memory overhead.

Technology Category

Application Category

📝 Abstract

Vision-language foundation models, such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we explore how to efficiently leverage class text information to mitigate these distribution drifts encountered by large pre-trained vision-language models (VLMs) during test-time inference. In particular, we propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed centroids of a label assignment problem, which is efficiently solved with Optimal Transport. Furthermore, the proposed adaptation method (CLIP-OT) integrates a multiple template knowledge distillation approach, which replicates multi-view contrastive learning strategies in unsupervised representation learning but without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of CLIP-OT, achieving performance gains of up to 7% over recent state-of-the-art methods, yet being computationally and memory efficient.

Problem

Research questions and friction points this paper is trying to address.

Mitigate distribution shifts in vision-language models

Generate pseudo-labels using class text embeddings

Enhance test-time adaptation with minimal computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses class text embeddings for label assignment

Employs Optimal Transport for efficient solution

Integrates multi-template knowledge distillation

🔎 Similar Papers

No similar papers found.