LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition

📅 2024-03-06

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Self-supervised contrastive learning often yields coarse-grained global representations, limiting its effectiveness for fine-grained visual recognition. To address this, we propose a purely self-supervised contrastive learning framework that synergistically integrates global and local representation learning. Our core innovation is the introduction of Local Discrimination (LoDisc), a novel pretraining task that explicitly models discriminative local regions without labels, enabled by a position-aware masking sampling strategy. We further design a global-local feature disentanglement and fusion mechanism, coupled with a fine-grained contrastive loss. Extensive experiments demonstrate that our method significantly outperforms existing self-supervised approaches on multiple fine-grained recognition benchmarks—including CUB-200-2011, Stanford Cars, and FGVC-Aircraft—while maintaining competitive performance on general object recognition tasks (e.g., ImageNet-1K). These results validate that explicit modeling of local discriminability delivers consistent, broad-spectrum gains in representation quality.

Technology Category

Application Category

📝 Abstract

Self-supervised contrastive learning strategy has attracted remarkable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we present to incorporate the subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called Local Discrimination (LoDisc) is proposed to explicitly supervise self-supervised model's focus towards local pivotal regions which are captured by a simple-but-effective location-wise mask sampling strategy. We show that Local Discrimination pretext task can effectively enhance fine-grained clues in important local regions, and the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. Meanwhile, the proposed method is also effective in general object recognition tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained recognition through global-local contrastive learning

Learning discriminative local features for self-supervised fine-grained tasks

Improving visual recognition by focusing on pivotal local regions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines global-local contrastive learning for fine-grained recognition

Uses local discrimination pretext task with mask sampling

Enhances fine-grained features in pivotal regions

🔎 Similar Papers

No similar papers found.

Bosch Group

Hildesheim, NDS, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)