Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address high cold-start latency and severe GPU memory fragmentation in multi-LoRA fine-tuned LLM inference services under serverless environments, this paper proposes LORA-Serverless. First, it introduces a lightweight LSTM-based traffic predictor to enable proactive adapter preloading, mitigating cold starts. Second, it pioneers a page-based adapter memory management mechanism—inspired by OS virtual memory—that supports fine-grained, on-demand adapter loading and memory reuse. By tightly integrating dynamic LoRA scheduling with serverless function orchestration, LORA-Serverless achieves, under realistic workloads: a 68% reduction in cold-start latency; sustained GPU memory utilization ≥87%; 1.52× higher throughput; and a 35% decrease in time-to-first-token (TTFT) under high concurrency. These improvements significantly enhance resource efficiency and responsiveness for multi-tenant LLM serving.

Technology Category

Application Category

📝 Abstract

The serverless computing paradigm offers compelling advantages for deploying Large Language Model (LLM) inference services, including elastic scaling and pay-per-use billing. However, serving multiple fine-tuned LLMs via Low-Rank Adaptation (LoRA) in serverless environments faces critical challenges: reactive adapter loading causes significant cold start latency, and frequent adapter swapping leads to severe GPU memory fragmentation. In this paper, we present Predictive-LoRA (P-LoRA), a proactive and fragmentation-aware serverless inference system for LoRA-based LLMs. P-LoRA introduces two key innovations: (1) a lightweight LSTM-based traffic predictor that forecasts adapter demand and proactively prefetches hot adapters from host memory to GPU, reducing cold start latency by up to 68%; and (2) a page-based adapter memory management mechanism inspired by operating system virtual memory, which keeps GPU memory utilization above 87% even under heterogeneous adapter ranks. We evaluate P-LoRA using production-like workloads derived from the Azure Functions trace. Experimental results demonstrate that P-LoRA achieves 1.52x higher throughput than S-LoRA while reducing the average Time-To-First-Token (TTFT) by 35% under high concurrency scenarios.

Problem

Research questions and friction points this paper is trying to address.

Reduces cold start latency in serverless LLM inference

Mitigates GPU memory fragmentation from adapter swapping

Improves throughput and response time for LoRA-based models

Innovation

Methods, ideas, or system contributions that make the work stand out.

LSTM-based traffic predictor for proactive adapter prefetching

Page-based memory management inspired by virtual memory

Reduces cold start latency and GPU memory fragmentation

🔎 Similar Papers

No similar papers found.