🤖 AI Summary
Vehicle re-identification (Re-ID) suffers from heavy reliance on manually annotated semantic attributes and poor generalization. To address this, we propose CLIP-SENet—the first end-to-end unsupervised semantic enhancement framework leveraging the CLIP image encoder, requiring no auxiliary text or attribute annotations. It autonomously discovers and refines vehicle semantics via zero-shot semantic guidance and an Adaptive Fine-grained Enhancement Module (AFEM). Furthermore, it integrates multi-level appearance features and jointly optimizes representation learning through contrastive loss and identity classification loss. Extensive experiments demonstrate state-of-the-art performance across three major benchmarks: 92.9% mAP / 98.7% Rank-1 on VeRi-776; 90.4% Rank-1 / 98.7% Rank-5 on VehicleID; and 89.1% mAP / 97.9% Rank-1 on VeRi-Wild—establishing new benchmarks for cross-camera vehicle Re-ID.
📝 Abstract
Vehicle re-identification (Re-ID) is a crucial task in intelligent transportation systems (ITS), aimed at retrieving and matching the same vehicle across different surveillance cameras. Numerous studies have explored methods to enhance vehicle Re-ID by focusing on semantic enhancement. However, these methods often rely on additional annotated information to enable models to extract effective semantic features, which brings many limitations. In this work, we propose a CLIP-based Semantic Enhancement Network (CLIP-SENet), an end-to-end framework designed to autonomously extract and refine vehicle semantic attributes, facilitating the generation of more robust semantic feature representations. Inspired by zero-shot solutions for downstream tasks presented by large-scale vision-language models, we leverage the powerful cross-modal descriptive capabilities of the CLIP image encoder to initially extract general semantic information. Instead of using a text encoder for semantic alignment, we design an adaptive fine-grained enhancement module (AFEM) to adaptively enhance this general semantic information at a fine-grained level to obtain robust semantic feature representations. These features are then fused with common Re-ID appearance features to further refine the distinctions between vehicles. Our comprehensive evaluation on three benchmark datasets demonstrates the effectiveness of CLIP-SENet. Our approach achieves new state-of-the-art performance, with 92.9% mAP and 98.7% Rank-1 on VeRi-776 dataset, 90.4% Rank-1 and 98.7% Rank-5 on VehicleID dataset, and 89.1% mAP and 97.9% Rank-1 on the more challenging VeRi-Wild dataset.