Efficient Deployment of Large Language Models on Resource-constrained Devices

📅 2025-01-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address resource constraints—particularly limited memory, computation, and communication—on edge devices for fine-tuning and inference of large language models (LLMs), this paper proposes FedSpine, a novel federated learning framework. FedSpine is the first to jointly integrate parameter-efficient fine-tuning (LoRA) with structured pruning, and introduces an iterative pruning-fine-tuning co-optimization mechanism. Crucially, it incorporates a prior-free online multi-armed bandit algorithm to adaptively assign pruning ratios and LoRA ranks across heterogeneous edge devices, effectively handling data non-IIDness while preserving privacy. Extensive experiments on an 80-node physical testbed demonstrate that FedSpine accelerates fine-tuning by 1.4×–6.9×, improves accuracy by 0.4%–4.5% at equivalent sparsity levels, and significantly reduces communication overhead, memory footprint, and computational cost compared to state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed to adaptively determine different pruning ratios and LoRA ranks for heterogeneous devices without any prior knowledge of their computing and communication capabilities. As a result, FedSpine maintains higher inference accuracy while improving fine-tuning efficiency. Experimental results conducted on a physical platform with 80 devices demonstrate that FedSpine can speed up fine-tuning by 1.4$ imes$-6.9$ imes$ and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.
Problem

Research questions and friction points this paper is trying to address.

Multi-task Processing
Data Imbalance
Privacy Preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

FedSpine
Parameter Reduction
Adaptive Model Optimization
🔎 Similar Papers
No similar papers found.
Zhiwei Yao
Zhiwei Yao
University of Science and Technology of China
Edge ComputingFederated Learning
Y
Yang Xu
School of Computer Science and Technology, University of Science and Technology of China, Suzhou Institute for Advanced Research, University of Science and Technology of China
Hongli Xu
Hongli Xu
University of Science and Technology of China
Software Defined NetworkCooperative CommunicationSensor Networks
Yunming Liao
Yunming Liao
University of Science and Technology of China
Edge IntelligenceEdge ComputingFederated LearningSplit Federated Learning
Z
Zuan Xie
School of Computer Science and Technology, University of Science and Technology of China, Suzhou Institute for Advanced Research, University of Science and Technology of China