🤖 AI Summary
To address the core challenges of complex adapter selection, high memory overhead, and low computational resource utilization—leading to increased latency in deploying large language models (LLMs) on multi-tenant edge devices—this paper proposes EdgeLoRA, a framework integrating adaptive adapter selection, heterogeneous memory management, and batched LoRA inference scheduling. Its key innovations include a lightweight adapter caching and pooling mechanism that enables dynamic adapter loading/unloading, cross-request LoRA weight sharing, and fine-grained memory reuse. Experimental evaluation on Llama3.1-8B demonstrates that EdgeLoRA achieves up to 4× higher throughput compared to baseline approaches and supports 2–3 orders of magnitude more concurrent adapters per device. These improvements significantly enhance the efficiency, scalability, and resource utilization of LLM serving at the edge.
📝 Abstract
Large Language Models (LLMs) have gained significant attention due to their versatility across a wide array of applications. Fine-tuning LLMs with parameter-efficient adapters, such as Low-Rank Adaptation (LoRA), enables these models to efficiently adapt to downstream tasks without extensive retraining. Deploying fine-tuned LLMs on multi-tenant edge devices offers substantial benefits, such as reduced latency, enhanced privacy, and personalized responses. However, serving LLMs efficiently on resource-constrained edge devices presents critical challenges, including the complexity of adapter selection for different tasks and memory overhead from frequent adapter swapping. Moreover, given the multiple requests in multi-tenant settings, processing requests sequentially results in underutilization of computational resources and increased latency. This paper introduces EdgeLoRA, an efficient system for serving LLMs on edge devices in multi-tenant environments. EdgeLoRA incorporates three key innovations: (1) an adaptive adapter selection mechanism to streamline the adapter configuration process; (2) heterogeneous memory management, leveraging intelligent adapter caching and pooling to mitigate memory operation overhead; and (3) batch LoRA inference, enabling efficient batch processing to significantly reduce computational latency. Comprehensive evaluations using the Llama3.1-8B model demonstrate that EdgeLoRA significantly outperforms the status quo (i.e., llama.cpp) in terms of both latency and throughput. The results demonstrate that EdgeLoRA can achieve up to a 4 times boost in throughput. Even more impressively, it can serve several orders of magnitude more adapters simultaneously. These results highlight EdgeLoRA's potential to transform edge deployment of LLMs in multi-tenant scenarios, offering a scalable and efficient solution for resource-constrained environments.