LayerLock: Non-collapsing Representation Learning with Progressive Freezing

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In self-supervised video representation learning, pixel-level reconstruction often leads to representation collapse, while direct latent feature prediction suffers from training instability. To address this, we propose LayerLock: a progressive layer-freezing strategy leveraging the differential convergence rates across ViT layers—shallower layers converge faster than deeper ones. Within a video masked autoencoder framework, LayerLock gradually freezes already-converged layers and shifts prediction targets from raw pixels to their latent-space outputs, enabling a stable transition from pixel reconstruction to latent prediction and effectively mitigating collapse. Experiments demonstrate that LayerLock significantly outperforms conventional pixel-prediction baselines on 4D spatiotemporal perception tasks. Moreover, it scales successfully to billion-parameter models (up to 4B parameters), achieving both high training efficiency and superior representation quality.

Technology Category

Application Category

📝 Abstract
We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from "representation collapse". We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.
Problem

Research questions and friction points this paper is trying to address.

Prevents representation collapse in self-supervised learning
Accelerates masked-autoencoding training via layer freezing
Enables scalable latent prediction for large vision models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive layer freezing schedule
Latent prediction without collapse
Scalable to 4B parameter models
🔎 Similar Papers