EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the core challenges of complex adapter selection, high memory overhead, and low computational resource utilization—leading to increased latency in deploying large language models (LLMs) on multi-tenant edge devices—this paper proposes EdgeLoRA, a framework integrating adaptive adapter selection, heterogeneous memory management, and batched LoRA inference scheduling. Its key innovations include a lightweight adapter caching and pooling mechanism that enables dynamic adapter loading/unloading, cross-request LoRA weight sharing, and fine-grained memory reuse. Experimental evaluation on Llama3.1-8B demonstrates that EdgeLoRA achieves up to 4× higher throughput compared to baseline approaches and supports 2–3 orders of magnitude more concurrent adapters per device. These improvements significantly enhance the efficiency, scalability, and resource utilization of LLM serving at the edge.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have gained significant attention due to their versatility across a wide array of applications. Fine-tuning LLMs with parameter-efficient adapters, such as Low-Rank Adaptation (LoRA), enables these models to efficiently adapt to downstream tasks without extensive retraining. Deploying fine-tuned LLMs on multi-tenant edge devices offers substantial benefits, such as reduced latency, enhanced privacy, and personalized responses. However, serving LLMs efficiently on resource-constrained edge devices presents critical challenges, including the complexity of adapter selection for different tasks and memory overhead from frequent adapter swapping. Moreover, given the multiple requests in multi-tenant settings, processing requests sequentially results in underutilization of computational resources and increased latency. This paper introduces EdgeLoRA, an efficient system for serving LLMs on edge devices in multi-tenant environments. EdgeLoRA incorporates three key innovations: (1) an adaptive adapter selection mechanism to streamline the adapter configuration process; (2) heterogeneous memory management, leveraging intelligent adapter caching and pooling to mitigate memory operation overhead; and (3) batch LoRA inference, enabling efficient batch processing to significantly reduce computational latency. Comprehensive evaluations using the Llama3.1-8B model demonstrate that EdgeLoRA significantly outperforms the status quo (i.e., llama.cpp) in terms of both latency and throughput. The results demonstrate that EdgeLoRA can achieve up to a 4 times boost in throughput. Even more impressively, it can serve several orders of magnitude more adapters simultaneously. These results highlight EdgeLoRA's potential to transform edge deployment of LLMs in multi-tenant scenarios, offering a scalable and efficient solution for resource-constrained environments.
Problem

Research questions and friction points this paper is trying to address.

Efficiently serving multi-tenant LLMs on edge devices
Reducing adapter selection complexity and memory overhead
Improving computational latency and throughput in edge environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive adapter selection for streamlined configuration
Heterogeneous memory management to reduce overhead
Batch LoRA inference for lower latency
🔎 Similar Papers
2024-07-09IEEE Communications Surveys & TutorialsCitations: 15
Zheyu Shen
Zheyu Shen
Graduate Student of Electronic and Computer Engineering, University of Maryland
Machine Learning SystemLarge Language Model
Yexiao He
Yexiao He
University of Maryland
Z
Ziyao Wang
University of Maryland, College Park
Y
Yuning Zhang
University of Maryland, College Park
Guoheng Sun
Guoheng Sun
University of Maryland, College Park
Deep LearningNatural Language ProcessingMobile Computing
W
Wanghao Ye
University of Maryland, College Park
A
Ang Li
University of Maryland, College Park