🤖 AI Summary
To address the challenge of fine-tuning large language models (LLMs) on memory- and compute-constrained edge devices, this paper proposes an efficient on-device fine-tuning method that requires no modifications to the inference engine. Our approach comprises three key contributions: (1) Parallelized Random Gradient Estimation (P-RGE), a low-overhead gradient approximation technique operating within a zeroth-order optimization framework; (2) a lightweight LoRA-FA module fully compatible with the ExecuTorch runtime, requiring no intrusive changes to the execution stack; and (3) the synergistic integration of LoRA-based parameter-efficient fine-tuning with P-RGE, achieving up to 68% reduction in GPU memory consumption and significantly lower computational overhead. Experiments demonstrate that our method maintains fine-tuning accuracy while accelerating training by 3.2×, enabling real-time, personalized LLM deployment on edge devices. This work provides a practical pathway for continual learning of LLMs in resource-constrained environments.
📝 Abstract
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands, as well as limited infrastructure support. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization, which uses multiple forward passes to approximate gradients. However, directly applying ZO methods on edge devices is impractical due to the high computational cost of multiple model perturbations required to achieve accuracy improvements. Based on these observations, we propose a memory- and computation-efficient LLM fine-tuning method for edge devices. Our approach has three key innovations: (1) We introduce a parallelized randomized gradient estimation (P-RGE) technique that achieves high parallel efficiency by leveraging outer-loop and inner-loop parallelization. This enables multiple function queries and forward passes to be executed in parallel, reducing training time. (2) We integrate P-RGE with parameter-efficient fine-tuning methods (e.g. LoRA) to further reduce computational and memory overhead. (3) We implement a P-RGE LoRA-FA module that fully supports fine-tuning with ExecuTorch. Our approach requires no modifications to ExecuTorch's runtime code, as it can be implemented with server-side code changes only. Experiments demonstrate that P-RGE achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy, paving the way for practical deployment of LLMs in real-time, on-device applications.