Unity is Power: Semi-Asynchronous Collaborative Training of Large-Scale Models with Structured Pruning in Resource-Limited Clients

📅 2024-10-11
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale collaborative model training across resource-constrained, heterogeneous edge devices faces challenges including data decentralization, computational heterogeneity, knowledge loss due to unstructured pruning, and straggler-induced computation bottlenecks. To address these, we propose Co-S²P, a semi-asynchronous collaborative training framework that innovatively integrates data-distribution-aware structured pruning with cross-module knowledge distillation. Co-S²P enables resource-adaptive submodel generation and semi-asynchronous parameter updates, and provides an asymptotically optimal convergence rate of O(1/√(N·E·Q)), where N, E, and Q denote the number of devices, local epochs, and pruning granularity, respectively. Evaluated on 16 NVIDIA Jetson devices, Co-S²P achieves up to 8.8% higher accuracy, 22% lower memory footprint, 24% faster training time, and 1.2× improved resource utilization compared to state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
In this work, we study to release the potential of massive heterogeneous weak computing power to collaboratively train large-scale models on dispersed datasets. In order to improve both efficiency and accuracy in resource-adaptive collaborative learning, we take the first step to consider the extit{unstructured pruning}, extit{varying submodel architectures}, extit{knowledge loss}, and extit{straggler} challenges simultaneously. We propose a novel semi-asynchronous collaborative training framework, namely ${Co ext{-}S}^2{P}$, with data distribution-aware structured pruning and cross-block knowledge transfer mechanism to address the above concerns. Furthermore, we provide theoretical proof that ${Co ext{-}S}^2{P}$ can achieve asymptotic optimal convergence rate of $O(1/sqrt{N^*EQ})$. Finally, we conduct extensive experiments on a real-world hardware testbed, in which 16 heterogeneous Jetson devices can be united to train large-scale models with parameters up to 0.11 billion. The experimental results demonstrate that $Co ext{-}S^2P$ improves accuracy by up to 8.8% and resource utilization by up to 1.2$ imes$ compared to state-of-the-art methods, while reducing memory consumption by approximately 22% and training time by about 24% on all resource-limited devices.
Problem

Research questions and friction points this paper is trying to address.

Training large models on resource-limited heterogeneous devices
Addressing unstructured pruning and varying submodel architectures
Mitigating knowledge loss and straggler issues collaboratively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-asynchronous collaborative training framework for efficiency
Data distribution-aware structured pruning to reduce resource use
Cross-block knowledge transfer mechanism to mitigate knowledge loss
🔎 Similar Papers
Y
Yan Li
Shandong University, Qingdao 266237, China
M
Mingyi Li
Shandong University, Qingdao 266237, China
X
Xiao Zhang
Shandong University, Qingdao 266237, China
Guangwei Xu
Guangwei Xu
Alibaba Group
NLP
F
Feng Chen
Shandong University, Qingdao 266237, China
Y
Yuan Yuan
Shandong University, Qingdao 266237, China
Yifei Zou
Yifei Zou
山东大学
Mengying Zhao
Mengying Zhao
Shandong University
embedded system
J
Jianbo Lu
Shandong University, Qingdao 266237, China
Dongxiao Yu
Dongxiao Yu
Professor of Computer Science, Shandong University
Distributed ComputingWireless NetworkingGraph Algorithms