Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation

📅 2024-11-26
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
To address the severe performance degradation of vision-language models (e.g., CLIP) under distribution shift during testing, this paper proposes a fine-tuning-free test-time adaptation method. The approach treats predefined class text embeddings as fixed semantic centroids and formulates label assignment as an optimal transport problem to generate high-quality pseudo-labels. Additionally, it introduces a multi-template knowledge distillation mechanism that effectively emulates multi-view contrastive learning at zero additional computational cost. This work is the first to explicitly model text embeddings as fixed semantic centroids for test-time adaptation. Evaluated on multiple standard benchmarks, the method achieves an average accuracy improvement of 7%, significantly outperforming existing state-of-the-art methods, while maintaining minimal computational and memory overhead.

Technology Category

Application Category

📝 Abstract
Vision-language foundation models, such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we explore how to efficiently leverage class text information to mitigate these distribution drifts encountered by large pre-trained vision-language models (VLMs) during test-time inference. In particular, we propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed centroids of a label assignment problem, which is efficiently solved with Optimal Transport. Furthermore, the proposed adaptation method (CLIP-OT) integrates a multiple template knowledge distillation approach, which replicates multi-view contrastive learning strategies in unsupervised representation learning but without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of CLIP-OT, achieving performance gains of up to 7% over recent state-of-the-art methods, yet being computationally and memory efficient.
Problem

Research questions and friction points this paper is trying to address.

Mitigate distribution shifts in vision-language models
Generate pseudo-labels using class text embeddings
Enhance test-time adaptation with minimal computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses class text embeddings for label assignment
Employs Optimal Transport for efficient solution
Integrates multi-template knowledge distillation
🔎 Similar Papers
No similar papers found.
S
Shambhavi Mishra
LIVIA, ETS Montréal, Canada
Julio Silva-Rodríguez
Julio Silva-Rodríguez
Postdoctoral Researcher, ÉTS Montréal
Computer VisionMachine LearningMedical Image Analysis
Ismail Ben Ayed
Ismail Ben Ayed
Professor, ETS Montreal
computer visionmachine learningoptimizationmedical image analysis
M
M. Pedersoli
International Laboratory on Learning Systems (ILLS), McGILL - ETS - MILA - CNRS - Université Paris-Saclay - CentraleSupélec, Canada
J
J. Dolz
International Laboratory on Learning Systems (ILLS), McGILL - ETS - MILA - CNRS - Université Paris-Saclay - CentraleSupélec, Canada