A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language models, such as CLIP, in street view attribute classification due to their reliance on global representations that fail to capture fine-grained local features. To overcome this, the authors propose CLIP-MHAdapter, a lightweight extension that introduces MLP-based adapters equipped with multi-head self-attention mechanisms over CLIP’s patch tokens to model local interdependencies among image regions. This design effectively enhances local semantic representation while adding only approximately 1.4 million trainable parameters. Evaluated on the Global StreetScapes dataset across eight attribute classification tasks, the method achieves state-of-the-art or competitive performance, striking an effective balance between high accuracy and low computational overhead.

Technology Category

Application Category

📝 Abstract

Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.

Problem

Research questions and friction points this paper is trying to address.

street-view image classification

fine-grained attributes

vision-language models

localised features

attribute classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive learning

attention-based adaptation

CLIP