InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing coupled LoRA serving systems face significant challenges under emerging architectures like Mixture-of-Experts (MoE), including prohibitive memory overhead, poor scalability, and increased tail latency. This work proposes the first distributed serving system that decouples LoRA execution from base model inference, leveraging dedicated LoRA servers to enable efficient multi-tenant deployment. The core innovations include a parallelism-aware execution mechanism, SLO-driven dynamic resource allocation, critical path optimization, GPU-initiated communication, and customized LoRA kernels. Experimental results demonstrate that, under stringent latency SLO constraints, the proposed system achieves a 3.05× improvement in average request throughput and increases the proportion of LoRA adapters meeting their SLOs by 54.0%.

Technology Category

Application Category

📝 Abstract

LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average $3.05\times$ increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.

Problem

Research questions and friction points this paper is trying to address.

LoRA

large language models

memory cost

tail-latency

multi-tenant serving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated Serving

LoRA

SLO-driven Provisioning