ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address three critical bottlenecks—parameter redundancy, high cold-start latency, and GPU resource contention—in LoRA-finetuned large language model (LLM) inference within serverless environments, this paper proposes LoRA-Server: the first LoRA-specialized runtime system for serverless LLM inference. Its core innovations include a function-level isolation mechanism enabling secure shared access to the backbone model, a hierarchical preloading strategy for LoRA weights, and contention-aware dynamic batching coupled with memory-adaptive weight offloading. Evaluated under realistic industrial workloads, LoRA-Server reduces time-to-first-token (TTFT) by 86% and inference cost by 89%, while significantly mitigating GPU resource waste. The system delivers a production-ready, system-level solution for lightweight and efficient Model-as-a-Service (MaaS).

Technology Category

Application Category

📝 Abstract

Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve general LLM but fail with Low-Rank Adaptation (LoRA) inference due to three key limitations: 1) massive parameter redundancy among functions where 99% of weights are unnecessarily duplicated, 2) costly artifact loading latency beyond LLM loading, and 3) magnified resource contention when serving multiple LoRA LLMs. These inefficiencies lead to massive GPU wastage, increased Time-To-First-Token (TTFT), and high monetary costs. We propose ServerlessLoRA, a novel serverless inference system designed for faster and cheaper LoRA LLM serving. ServerlessLoRA enables secure backbone LLM sharing across isolated LoRA functions to reduce redundancy. We design a pre-loading method that pre-loads comprehensive LoRA artifacts to minimize cold-start latency. Furthermore, ServerlessLoRA employs contention aware batching and offloading to mitigate GPU resource conflicts during bursty workloads. Experiment on industrial workloads demonstrates that ServerlessLoRA reduces TTFT by up to 86% and cuts monetary costs by up to 89% compared to state-of-the-art LLM inference solutions.

Problem

Research questions and friction points this paper is trying to address.

Reduces parameter redundancy in LoRA-based LLM serverless inference

Minimizes cold-start latency by pre-loading LoRA artifacts

Mitigates GPU resource contention during bursty workloads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enables secure backbone LLM sharing

Pre-loads comprehensive LoRA artifacts

Employs contention-aware batching and offloading

🔎 Similar Papers

No similar papers found.