Unsupervised Training of Vision Transformers with Synthetic Negatives

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the under-exploitation of hard negative samples in self-supervised visual representation learning and proposes, for the first time, synthesizing hard negatives within Vision Transformers (ViTs). Methodologically, it integrates a controllable synthetic negative generation mechanism into self-supervised frameworks—such as MAE—based on DeiT-S and Swin-T architectures to sharpen discriminative boundaries in contrastive learning. The key contributions are: (1) empirical validation that synthetic hard negatives critically enhance ViT representation quality; (2) an efficient training paradigm requiring no additional annotations or conventional data augmentation; and (3) consistent improvements in feature discriminability—achieving average gains of 1.2–1.8 percentage points over baselines on ImageNet linear evaluation and multiple downstream tasks. This study provides a novel, principled direction and a reproducible technical pathway for optimizing ViTs in self-supervised learning.

Technology Category

Application Category

📝 Abstract
This paper does not introduce a novel method per se. Instead, we address the neglected potential of hard negative samples in self-supervised learning. Previous works explored synthetic hard negatives but rarely in the context of vision transformers. We build on this observation and integrate synthetic hard negatives to improve vision transformer representation learning. This simple yet effective technique notably improves the discriminative power of learned representations. Our experiments show performance improvements for both DeiT-S and Swin-T architectures.
Problem

Research questions and friction points this paper is trying to address.

Addressing neglected potential of hard negatives
Improving vision transformer representation learning
Enhancing discriminative power with synthetic negatives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic hard negatives for vision transformers
Improving representation learning discriminative power
Simple effective technique for DeiT-S Swin-T
🔎 Similar Papers
No similar papers found.