Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines

📅 2024-09-23

📈 Citations: 1

✨ Influential: 0

career value

244K/year

🤖 AI Summary

To address the challenge of fine-tuning large language models (LLMs) on memory- and compute-constrained edge devices, this paper proposes an efficient on-device fine-tuning method that requires no modifications to the inference engine. Our approach comprises three key contributions: (1) Parallelized Random Gradient Estimation (P-RGE), a low-overhead gradient approximation technique operating within a zeroth-order optimization framework; (2) a lightweight LoRA-FA module fully compatible with the ExecuTorch runtime, requiring no intrusive changes to the execution stack; and (3) the synergistic integration of LoRA-based parameter-efficient fine-tuning with P-RGE, achieving up to 68% reduction in GPU memory consumption and significantly lower computational overhead. Experiments demonstrate that our method maintains fine-tuning accuracy while accelerating training by 3.2×, enabling real-time, personalized LLM deployment on edge devices. This work provides a practical pathway for continual learning of LLMs in resource-constrained environments.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands, as well as limited infrastructure support. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization, which uses multiple forward passes to approximate gradients. However, directly applying ZO methods on edge devices is impractical due to the high computational cost of multiple model perturbations required to achieve accuracy improvements. Based on these observations, we propose a memory- and computation-efficient LLM fine-tuning method for edge devices. Our approach has three key innovations: (1) We introduce a parallelized randomized gradient estimation (P-RGE) technique that achieves high parallel efficiency by leveraging outer-loop and inner-loop parallelization. This enables multiple function queries and forward passes to be executed in parallel, reducing training time. (2) We integrate P-RGE with parameter-efficient fine-tuning methods (e.g. LoRA) to further reduce computational and memory overhead. (3) We implement a P-RGE LoRA-FA module that fully supports fine-tuning with ExecuTorch. Our approach requires no modifications to ExecuTorch's runtime code, as it can be implemented with server-side code changes only. Experiments demonstrate that P-RGE achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy, paving the way for practical deployment of LLMs in real-time, on-device applications.

Problem

Research questions and friction points this paper is trying to address.

Enabling efficient LLM fine-tuning on resource-constrained edge devices

Addressing high computational costs of zeroth-order optimization methods

Overcoming memory and infrastructure limitations for on-device training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallelized randomized gradient estimator eliminates sequential forward passes

Specialized Multi-Perturbed LoRA module enables efficient parallelism realization

Seamless ExecuTorch integration enables on-device training without runtime modifications

🔎 Similar Papers

No similar papers found.