Efficient Deployment of Large Language Models on Resource-constrained Devices

📅 2025-01-05

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address resource constraints—particularly limited memory, computation, and communication—on edge devices for fine-tuning and inference of large language models (LLMs), this paper proposes FedSpine, a novel federated learning framework. FedSpine is the first to jointly integrate parameter-efficient fine-tuning (LoRA) with structured pruning, and introduces an iterative pruning-fine-tuning co-optimization mechanism. Crucially, it incorporates a prior-free online multi-armed bandit algorithm to adaptively assign pruning ratios and LoRA ranks across heterogeneous edge devices, effectively handling data non-IIDness while preserving privacy. Extensive experiments on an 80-node physical testbed demonstrate that FedSpine accelerates fine-tuning by 1.4×–6.9×, improves accuracy by 0.4%–4.5% at equivalent sparsity levels, and significantly reduces communication overhead, memory footprint, and computational cost compared to state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed to adaptively determine different pruning ratios and LoRA ranks for heterogeneous devices without any prior knowledge of their computing and communication capabilities. As a result, FedSpine maintains higher inference accuracy while improving fine-tuning efficiency. Experimental results conducted on a physical platform with 80 devices demonstrate that FedSpine can speed up fine-tuning by 1.4$ imes$-6.9$ imes$ and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.

Problem

Research questions and friction points this paper is trying to address.

Multi-task Processing

Data Imbalance

Privacy Preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

FedSpine

Parameter Reduction

Adaptive Model Optimization

🔎 Similar Papers

No similar papers found.