Decorrelation Speeds Up Vision Transformers

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) pretrained via Masked Autoencoders (MAEs) achieve strong performance under low-label regimes, yet their high computational cost hinders industrial deployment. To address this, we propose **Selective Decorrelation Backpropagation (DecorrBP)**—a lightweight optimization technique that imposes layer-wise gradient covariance constraints exclusively within the MAE encoder, enhancing gradient propagation efficiency and convergence speed while preserving training stability. Evaluated on ImageNet-1K, DecorrBP reduces pretraining time by 21.1% and carbon emissions by 21.4%. On downstream ADE20K semantic segmentation, it improves mIoU by 1.1 points; consistent gains are also observed on industrial datasets. Crucially, DecorrBP is the first method to integrate gradient decorrelation into the MAE training framework without modifying model architecture or loss functions—enabling efficient, low-carbon, and high-performance ViT pretraining.

Technology Category

Application Category

📝 Abstract
Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs of ViT pre-training in resource-limited settings
Accelerating MAE convergence while maintaining model stability
Decreasing training time and carbon emissions for vision transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Decorrelated Backpropagation to accelerate convergence
Selectively applies DBP to encoder for stable training
Reduces training time and energy while improving performance
🔎 Similar Papers
No similar papers found.
K
Kieran Carrigg
Department of Machine Learning and Neural Computing, Donders Institute for Brain, Cognition, and Behaviour, Thomas van Aquinostraat 4, 6525 GD Nijmegen, The Netherlands
R
Rob van Gastel
AI & Vision Team, ASMPT ALSI B.V., Platinawerf 20, 6641 TL Beuningen, The Netherlands
M
Melda Yeghaian
Department of Machine Learning and Neural Computing, Donders Institute for Brain, Cognition, and Behaviour, Thomas van Aquinostraat 4, 6525 GD Nijmegen, The Netherlands
S
Sander Dalm
Department of Machine Learning and Neural Computing, Donders Institute for Brain, Cognition, and Behaviour, Thomas van Aquinostraat 4, 6525 GD Nijmegen, The Netherlands
Faysal Boughorbel
Faysal Boughorbel
AI & Vision Team, ASMPT ALSI B.V., Platinawerf 20, 6641 TL Beuningen, The Netherlands
Marcel van Gerven
Marcel van Gerven
Professor of Artificial Cognitive Systems, Donders Institute for Brain, Cognition and Behaviour
Artificial IntelligenceMachine LearningComputational Neuroscience