A Theoretical Analysis of Self-Supervised Learning for Vision Transformers

📅 2024-03-04
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work theoretically investigates the mechanistic differences between global and local feature learning in self-supervised vision transformers (ViTs). To address feature imbalance, we propose a data distribution model incorporating explicit global/local biases. We conduct the first rigorous analysis of gradient descent dynamics for masked autoencoders (MAEs) and contrastive learning (CL) on single-layer softmax ViTs, deriving tight reconstruction error bounds. Our theoretical analysis proves that MAEs inherently possess multi-scale representation capability and achieve near-optimal reconstruction of both global and local features under feature imbalance. In contrast, CL exhibits strong global bias—even under mild imbalance—and severely neglects local structure. These theoretical predictions align closely with empirical observations, providing the first formal explanation for MAE’s robustness and CL’s inherent limitations in capturing fine-grained visual patterns.

Technology Category

Application Category

📝 Abstract
Self-supervised learning has become a cornerstone in computer vision, primarily divided into reconstruction-based methods like masked autoencoders (MAE) and discriminative methods such as contrastive learning (CL). Recent empirical observations reveal that MAE and CL capture different types of representations: CL tends to focus on global patterns, while MAE adeptly captures both global and subtle local information simultaneously. Despite a flurry of recent empirical investigations to shed light on this difference, theoretical understanding remains limited, especially on the dominant architecture vision transformers (ViTs). In this paper, to provide rigorous insights, we model the visual data distribution by considering two types of spatial features: dominant global features and comparatively minuscule local features, and study the impact of imbalance among these features. We analyze the training dynamics of one-layer softmax-based ViTs on both MAE and CL objectives using gradient descent. Our analysis shows that as the degree of feature imbalance varies, ViTs trained with the MAE objective effectively learn both global and local features to achieve near-optimal reconstruction, while the CL-trained ViTs favor predominantly global features, even under mild imbalance. These results provide a theoretical explanation for distinct behaviors of MAE and CL observed in empirical studies.
Problem

Research questions and friction points this paper is trying to address.

Theoretical understanding of self-supervised learning
Comparison of MAE and CL in vision transformers
Impact of feature imbalance on learning dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning techniques
Vision Transformers analysis
Feature imbalance impact study
🔎 Similar Papers
Y
Yu Huang
Department of Statistics and Data Science, Wharton School, University of Pennsylvania
Zixin Wen
Zixin Wen
Carnegie Mellon University
Machine Learning Theory
Yuejie Chi
Yuejie Chi
Yale University
data sciencegenerative AIreinforcement learningsignal processing
Y
Yingbin Liang
Department of Electrical and Computer Engineering, The Ohio State University