Self-Supervised Learning from Structural Invariance

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge in self-supervised learning where a single input often corresponds to multiple semantically consistent yet structurally diverse targets—such as consecutive video frames—leading to inadequate modeling of conditional uncertainty. To this end, the authors propose AdaSSL, a general-purpose regularization mechanism that, for the first time, integrates latent variable modeling with a variational lower bound on mutual information. By maximizing the mutual information between embedding pairs, AdaSSL effectively captures the multimodal nature of the output distribution while preserving structural invariance under conditional uncertainty. The method seamlessly integrates into both contrastive learning and knowledge distillation frameworks. Extensive experiments across causal representation learning, fine-grained image understanding, and video world modeling demonstrate significant performance gains, underscoring AdaSSL’s generality and effectiveness.

Technology Category

Application Category

📝 Abstract

Joint-embedding self-supervised learning (SSL), the key paradigm for unsupervised representation learning from visual data, learns from invariances between semantically-related data pairs. We study the one-to-many mapping problem in SSL, where each datum may be mapped to multiple valid targets. This arises when data pairs come from naturally occurring generative processes, e.g., successive video frames. We show that existing methods struggle to flexibly capture this conditional uncertainty. As a remedy, we introduce a latent variable to account for this uncertainty and derive a variational lower bound on the mutual information between paired embeddings. Our derivation yields a simple regularization term for standard SSL objectives. The resulting method, which we call AdaSSL, applies to both contrastive and distillation-based SSL objectives, and we empirically show its versatility in causal representation learning, fine-grained image understanding, and world modeling on videos.

Problem

Research questions and friction points this paper is trying to address.

self-supervised learning

one-to-many mapping

conditional uncertainty

structural invariance

representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning

one-to-many mapping

latent variable

mutual information

variational lower bound

🔎 Similar Papers

A Survey of the Self Supervised Learning Mechanisms for Vision Transformers

2024-08-30arXiv.orgCitations: 1

DualContrast: Unsupervised Disentangling of Content and Transformations with Implicit Parameterization

2024-05-27arXiv.orgCitations: 0