Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high cold-start latency and severe GPU memory fragmentation in multi-LoRA fine-tuned LLM inference services under serverless environments, this paper proposes LORA-Serverless. First, it introduces a lightweight LSTM-based traffic predictor to enable proactive adapter preloading, mitigating cold starts. Second, it pioneers a page-based adapter memory management mechanism—inspired by OS virtual memory—that supports fine-grained, on-demand adapter loading and memory reuse. By tightly integrating dynamic LoRA scheduling with serverless function orchestration, LORA-Serverless achieves, under realistic workloads: a 68% reduction in cold-start latency; sustained GPU memory utilization ≥87%; 1.52× higher throughput; and a 35% decrease in time-to-first-token (TTFT) under high concurrency. These improvements significantly enhance resource efficiency and responsiveness for multi-tenant LLM serving.

Technology Category

Application Category

📝 Abstract
The serverless computing paradigm offers compelling advantages for deploying Large Language Model (LLM) inference services, including elastic scaling and pay-per-use billing. However, serving multiple fine-tuned LLMs via Low-Rank Adaptation (LoRA) in serverless environments faces critical challenges: reactive adapter loading causes significant cold start latency, and frequent adapter swapping leads to severe GPU memory fragmentation. In this paper, we present Predictive-LoRA (P-LoRA), a proactive and fragmentation-aware serverless inference system for LoRA-based LLMs. P-LoRA introduces two key innovations: (1) a lightweight LSTM-based traffic predictor that forecasts adapter demand and proactively prefetches hot adapters from host memory to GPU, reducing cold start latency by up to 68%; and (2) a page-based adapter memory management mechanism inspired by operating system virtual memory, which keeps GPU memory utilization above 87% even under heterogeneous adapter ranks. We evaluate P-LoRA using production-like workloads derived from the Azure Functions trace. Experimental results demonstrate that P-LoRA achieves 1.52x higher throughput than S-LoRA while reducing the average Time-To-First-Token (TTFT) by 35% under high concurrency scenarios.
Problem

Research questions and friction points this paper is trying to address.

Reduces cold start latency in serverless LLM inference
Mitigates GPU memory fragmentation from adapter swapping
Improves throughput and response time for LoRA-based models
Innovation

Methods, ideas, or system contributions that make the work stand out.

LSTM-based traffic predictor for proactive adapter prefetching
Page-based memory management inspired by virtual memory
Reduces cold start latency and GPU memory fragmentation
🔎 Similar Papers
No similar papers found.
Y
Yinan Ni
University of Illinois at Urbana–Champaign, Urbana, IL, USA
X
Xiao Yang
Santa Clara University, Santa Clara, CA, USA
Yuqi Tang
Yuqi Tang
Duke University
Medical ImagingComputer VisionImage Quality
Z
Zhimin Qiu
University of Southern California, Los Angeles, CA, USA
C
Chen Wang
University of Missouri–Kansas City, Kansas City, MO, USA
T
Tingzhou Yuan
Boston University, Boston, MA, USA