🤖 AI Summary
Vision Transformers (ViTs) pretrained via Masked Autoencoders (MAEs) achieve strong performance under low-label regimes, yet their high computational cost hinders industrial deployment. To address this, we propose **Selective Decorrelation Backpropagation (DecorrBP)**—a lightweight optimization technique that imposes layer-wise gradient covariance constraints exclusively within the MAE encoder, enhancing gradient propagation efficiency and convergence speed while preserving training stability. Evaluated on ImageNet-1K, DecorrBP reduces pretraining time by 21.1% and carbon emissions by 21.4%. On downstream ADE20K semantic segmentation, it improves mIoU by 1.1 points; consistent gains are also observed on industrial datasets. Crucially, DecorrBP is the first method to integrate gradient decorrelation into the MAE training framework without modifying model architecture or loss functions—enabling efficient, low-carbon, and high-performance ViT pretraining.
📝 Abstract
Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.