🤖 AI Summary
To address high cold-start latency and severe GPU memory fragmentation in multi-LoRA fine-tuned LLM inference services under serverless environments, this paper proposes LORA-Serverless. First, it introduces a lightweight LSTM-based traffic predictor to enable proactive adapter preloading, mitigating cold starts. Second, it pioneers a page-based adapter memory management mechanism—inspired by OS virtual memory—that supports fine-grained, on-demand adapter loading and memory reuse. By tightly integrating dynamic LoRA scheduling with serverless function orchestration, LORA-Serverless achieves, under realistic workloads: a 68% reduction in cold-start latency; sustained GPU memory utilization ≥87%; 1.52× higher throughput; and a 35% decrease in time-to-first-token (TTFT) under high concurrency. These improvements significantly enhance resource efficiency and responsiveness for multi-tenant LLM serving.
📝 Abstract
The serverless computing paradigm offers compelling advantages for deploying Large Language Model (LLM) inference services, including elastic scaling and pay-per-use billing. However, serving multiple fine-tuned LLMs via Low-Rank Adaptation (LoRA) in serverless environments faces critical challenges: reactive adapter loading causes significant cold start latency, and frequent adapter swapping leads to severe GPU memory fragmentation. In this paper, we present Predictive-LoRA (P-LoRA), a proactive and fragmentation-aware serverless inference system for LoRA-based LLMs. P-LoRA introduces two key innovations: (1) a lightweight LSTM-based traffic predictor that forecasts adapter demand and proactively prefetches hot adapters from host memory to GPU, reducing cold start latency by up to 68%; and (2) a page-based adapter memory management mechanism inspired by operating system virtual memory, which keeps GPU memory utilization above 87% even under heterogeneous adapter ranks. We evaluate P-LoRA using production-like workloads derived from the Azure Functions trace. Experimental results demonstrate that P-LoRA achieves 1.52x higher throughput than S-LoRA while reducing the average Time-To-First-Token (TTFT) by 35% under high concurrency scenarios.