🤖 AI Summary
Self-supervised contrastive learning often yields coarse-grained global representations, limiting its effectiveness for fine-grained visual recognition. To address this, we propose a purely self-supervised contrastive learning framework that synergistically integrates global and local representation learning. Our core innovation is the introduction of Local Discrimination (LoDisc), a novel pretraining task that explicitly models discriminative local regions without labels, enabled by a position-aware masking sampling strategy. We further design a global-local feature disentanglement and fusion mechanism, coupled with a fine-grained contrastive loss. Extensive experiments demonstrate that our method significantly outperforms existing self-supervised approaches on multiple fine-grained recognition benchmarks—including CUB-200-2011, Stanford Cars, and FGVC-Aircraft—while maintaining competitive performance on general object recognition tasks (e.g., ImageNet-1K). These results validate that explicit modeling of local discriminability delivers consistent, broad-spectrum gains in representation quality.
📝 Abstract
Self-supervised contrastive learning strategy has attracted remarkable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we present to incorporate the subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called Local Discrimination (LoDisc) is proposed to explicitly supervise self-supervised model's focus towards local pivotal regions which are captured by a simple-but-effective location-wise mask sampling strategy. We show that Local Discrimination pretext task can effectively enhance fine-grained clues in important local regions, and the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. Meanwhile, the proposed method is also effective in general object recognition tasks.