Fake & Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Self-supervised vision transformer training heavily relies on large-scale real-world data and faces challenges in manually curating hard negative samples. Method: This paper proposes Syn2Co, the first framework to jointly leverage generative models for synthesizing image data and construct synthetic hard negatives directly in the representation space, thereby establishing a more challenging contrastive learning environment—enabling end-to-end self-supervised training on DeiT-S and Swin-T without real labels or explicit hard-negative mining. Contribution/Results: Syn2Co significantly improves feature robustness and cross-task transferability, achieving performance on ImageNet linear evaluation that closely approaches that of fully supervised training on real data. Our work delineates the effective boundary of synthetic data in representation learning and establishes a novel paradigm for reducing dependence on real-world supervision.

Technology Category

Application Category

📝 Abstract

This paper does not introduce a new method per se. Instead, we build on existing self-supervised learning approaches for vision, drawing inspiration from the adage "fake it till you make it". While contrastive self-supervised learning has achieved remarkable success, it typically relies on vast amounts of real-world data and carefully curated hard negatives. To explore alternatives to these requirements, we investigate two forms of "faking it" in vision transformers. First, we study the potential of generative models for unsupervised representation learning, leveraging synthetic data to augment sample diversity. Second, we examine the feasibility of generating synthetic hard negatives in the representation space, creating diverse and challenging contrasts. Our framework - dubbed Syn2Co - combines both approaches and evaluates whether synthetically enhanced training can lead to more robust and transferable visual representations on DeiT-S and Swin-T architectures. Our findings highlight the promise and limitations of synthetic data in self-supervised learning, offering insights for future work in this direction.

Problem

Research questions and friction points this paper is trying to address.

Exploring synthetic data for self-supervised vision representation learning

Generating synthetic hard negatives in representation space

Evaluating synthetic training robustness on vision transformer architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data augmentation for vision transformers

Generating synthetic hard negatives in representation space

Syn2Co framework combining both synthetic approaches

🔎 Similar Papers

SynCo: Synthetic Hard Negatives in Contrastive Learning for Better Unsupervised Visual Representations

2024-10-03arXiv.orgCitations: 0

Bosch Group

Renningen, BW, DE

PhD – Generative Models for Closed-loop Synthesis

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)