Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of effectively integrating generative and predictive paradigms in self-supervised learning while maintaining efficiency, stability, and multi-level semantic modeling. To this end, we propose Bootleg, a novel approach that introduces, for the first time, a multi-layer latent self-distillation mechanism: a student network jointly learns features across varying levels of abstraction by predicting the representations from multiple hidden layers of a momentum teacher network. By synergistically combining the strengths of both paradigms, Bootleg mitigates target instability and enhances high-level semantic representation learning. Implemented within a Transformer framework, Bootleg outperforms I-JEPA by up to 10% on ImageNet-1K and iNaturalist-21 image classification benchmarks and achieves substantial performance gains on ADE20K and Cityscapes semantic segmentation tasks.

Technology Category

Application Category

📝 Abstract
The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.
Problem

Research questions and friction points this paper is trying to address.

self-supervised learning
generative methods
predictive methods
training instability
high-level features
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation
hidden layers
self-supervised learning
hierarchical representation
Bootleg
🔎 Similar Papers
No similar papers found.