ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the rapid growth in parameters and computational cost during the depth scaling of pretrained Vision Transformers (ViTs), this paper proposes ScaleNet—a training-free, efficient depth-scaling method. ScaleNet extends ViT depth by inserting incremental layers while jointly leveraging inter-layer weight sharing and lightweight learnable adaptation modules (e.g., parallel LoRA-style adapters). Its core innovation lies in mitigating performance degradation induced by depth expansion with negligible parameter overhead (≈0% additional parameters). On ImageNet-1K, a 2× deeper DeiT-Base model achieves a +7.42% top-1 accuracy gain and reduces training time to one-third of conventional retraining. Strong generalization is further validated on downstream object detection tasks. ScaleNet establishes a new paradigm for low-cost, high-performance ViT scaling—enabling substantial depth expansion without architectural redesign or full model retraining.

Technology Category

Application Category

📝 Abstract
Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2$ imes$ depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.
Problem

Research questions and friction points this paper is trying to address.

Efficiently scaling vision transformers with minimal parameter increases
Reducing computational costs of training large ViT models
Maintaining performance while expanding model depth through weight sharing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling pretrained ViTs with incremental parameter insertion
Using layer-wise weight sharing for parameter efficiency
Employing parallel adapter modules for performance optimization
🔎 Similar Papers
No similar papers found.
Zhiwei Hao
Zhiwei Hao
Beijing Institute of Technology
computer visionefficient deep learning
Jianyuan Guo
Jianyuan Guo
City University of Hong Kong (CityU)
L
Li Shen
School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen 510275, China
K
Kai Han
Huawei Noah’s Ark Lab, Beijing 100084, China
Yehui Tang
Yehui Tang
Shanghai Jiao Tong University
Machine LearningQuantum AI & AI4Science
H
Han Hu
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
Yunhe Wang
Yunhe Wang
Noah's Ark Lab, Huawei Technologies
Deep LearningLanguage ModelMachine LearningComputer Vision