Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

📅 2023-10-26
🏛️ arXiv.org
📈 Citations: 10
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses unsupervised semantic-label-free person and vehicle re-identification (Re-ID) by adapting the large-scale vision-language model CLIP. Contrary to prevailing prompt-learning paradigms—which we empirically identify as suboptimal for CLIP-based Re-ID—we propose a prompt-free, end-to-end fine-tuning framework termed Prototype Contrastive Learning (PCL), which directly optimizes CLIP’s image encoder. PCL leverages class prototypes derived from image features and enforces contrastive alignment between visual and textual modalities. We further extend PCL to fully unsupervised settings: pseudo-labels are generated via unsupervised clustering, and cross-modal feature alignment is jointly optimized with clustering. Evaluated on multiple Person and Vehicle Re-ID benchmarks, our method consistently outperforms existing CLIP-based Re-ID approaches. In the unsupervised setting, it achieves state-of-the-art performance, demonstrating both the efficacy and generalizability of direct image-encoder fine-tuning for Re-ID.
📝 Abstract
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) across various supervision settings. Although prompt learning has enabled a recent work named CLIP-ReID to achieve promising performance, the underlying mechanisms and the necessity of prompt learning remain unclear due to the absence of semantic labels in ReID tasks. In this work, we first analyze the role prompt learning in CLIP-ReID and identify its limitations. Based on our investigations, we propose a simple yet effective approach to adapt CLIP for supervised object Re-ID. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning. Experimental results on both person and vehicle Re-ID datasets demonstrate the competitiveness of our method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP fine-tuning approach to unsupervised scenarios, where we achieve state-of-the art performance.
Problem

Research questions and friction points this paper is trying to address.

Adapting CLIP models for object re-identification tasks
Eliminating prompt learning necessity through direct fine-tuning
Extending prototypical contrastive learning to unsupervised scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning CLIP image encoder with contrastive loss
Using prototypical learning to replace prompt learning
Applying method to both supervised and unsupervised scenarios
🔎 Similar Papers
No similar papers found.
J
Jiachen Li
College of Information Science and Electronic Engineering, Zhejiang University, China
Xiaojin Gong
Xiaojin Gong
Zhejiang University
Computer VisionImage ProcessingArtificial Intelligence