🤖 AI Summary
The lack of principled criteria for choosing between contrastive joint-embedding and generative reconstruction paradigms in self-supervised learning (SSL) hinders theoretical understanding and practical design.
Method: We conduct a rigorous theoretical analysis under linear model assumptions, deriving closed-form solutions for both paradigms and explicitly modeling the view-generation process to characterize the impact of data augmentations and nuisance features on representation learning.
Results: Our analysis reveals that joint-embedding achieves asymptotically optimal performance under strong nuisance features with weaker alignment requirements, whereas reconstruction is inherently sensitive to such features. We further derive the minimal necessary condition linking augmentation strength and feature alignment. This work provides the first quantitative, interpretable explanation for the empirical superiority of joint-embedding over reconstruction on complex real-world data, establishing the first theoretically grounded, explainable guidance for SSL paradigm selection.
📝 Abstract
Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction based methods. These results not only clarify the trade offs between the two paradigms but also substantiate the empirical success of joint embedding approaches on real world challenging datasets.