🤖 AI Summary
To address the rapid growth in parameters and computational cost during the depth scaling of pretrained Vision Transformers (ViTs), this paper proposes ScaleNet—a training-free, efficient depth-scaling method. ScaleNet extends ViT depth by inserting incremental layers while jointly leveraging inter-layer weight sharing and lightweight learnable adaptation modules (e.g., parallel LoRA-style adapters). Its core innovation lies in mitigating performance degradation induced by depth expansion with negligible parameter overhead (≈0% additional parameters). On ImageNet-1K, a 2× deeper DeiT-Base model achieves a +7.42% top-1 accuracy gain and reduces training time to one-third of conventional retraining. Strong generalization is further validated on downstream object detection tasks. ScaleNet establishes a new paradigm for low-cost, high-performance ViT scaling—enabling substantial depth expansion without architectural redesign or full model retraining.
📝 Abstract
Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2$ imes$ depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.