On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing theoretical frameworks struggle to accurately characterize the intricate interaction between self-supervised pretraining and downstream fine-tuning, and pretrained representations are often unidentifiable due to inherent symmetries. This work proposes an asymptotic analysis framework based on two-stage M-estimation that, for the first time, incorporates Riemannian geometry to address representation symmetries. By leveraging orbit invariance, the framework establishes a rigorous connection between pretrained representations and downstream predictors, thereby precisely characterizing the limiting distribution of downstream test risk. The approach yields substantially tighter risk bounds than existing theories across canonical settings—including spectral pretraining, factor models, and Gaussian mixture models—offering improved theoretical guarantees for representation learning.
📝 Abstract
Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.
Problem

Research questions and friction points this paper is trying to address.

self-supervised pre-training
two-stage M-estimation
representation symmetry
asymptotic theory
orbit-invariance
Innovation

Methods, ideas, or system contributions that make the work stand out.

two-stage M-estimation
representation symmetry
orbit-invariance
asymptotic theory
self-supervised pre-training
🔎 Similar Papers
No similar papers found.
M
Mohammad Tinati
Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California
Stephen Tu
Stephen Tu
University of Southern California