🤖 AI Summary
This work addresses the under-exploitation of hard negative samples in self-supervised visual representation learning and proposes, for the first time, synthesizing hard negatives within Vision Transformers (ViTs). Methodologically, it integrates a controllable synthetic negative generation mechanism into self-supervised frameworks—such as MAE—based on DeiT-S and Swin-T architectures to sharpen discriminative boundaries in contrastive learning. The key contributions are: (1) empirical validation that synthetic hard negatives critically enhance ViT representation quality; (2) an efficient training paradigm requiring no additional annotations or conventional data augmentation; and (3) consistent improvements in feature discriminability—achieving average gains of 1.2–1.8 percentage points over baselines on ImageNet linear evaluation and multiple downstream tasks. This study provides a novel, principled direction and a reproducible technical pathway for optimizing ViTs in self-supervised learning.
📝 Abstract
This paper does not introduce a novel method per se. Instead, we address the neglected potential of hard negative samples in self-supervised learning. Previous works explored synthetic hard negatives but rarely in the context of vision transformers. We build on this observation and integrate synthetic hard negatives to improve vision transformer representation learning. This simple yet effective technique notably improves the discriminative power of learned representations. Our experiments show performance improvements for both DeiT-S and Swin-T architectures.