A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

πŸ“… 2026-02-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation of existing vision-language models, such as CLIP, in street view attribute classification due to their reliance on global representations that fail to capture fine-grained local features. To overcome this, the authors propose CLIP-MHAdapter, a lightweight extension that introduces MLP-based adapters equipped with multi-head self-attention mechanisms over CLIP’s patch tokens to model local interdependencies among image regions. This design effectively enhances local semantic representation while adding only approximately 1.4 million trainable parameters. Evaluated on the Global StreetScapes dataset across eight attribute classification tasks, the method achieves state-of-the-art or competitive performance, striking an effective balance between high accuracy and low computational overhead.

Technology Category

Application Category

πŸ“ Abstract
Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.
Problem

Research questions and friction points this paper is trying to address.

street-view image classification
fine-grained attributes
vision-language models
localised features
attribute classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive learning
attention-based adaptation
CLIP
multi-head self-attention
street-view image classification
πŸ”Ž Similar Papers
No similar papers found.
Q
Qi You
SpaceTimeLab, University College London
Y
Yitai Cheng
SpaceTimeLab, University College London
Z
Zichao Zeng
3DIMPact & SpaceTimeLab, University College London
James Haworth
James Haworth
Associate Professor in Spatio-temporal Analytics, University College London
GIScienceSpatio-temporalMachine LearningTransportIntelligent Transportation Systems