Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LoRA fine-tuning and inference systems operate in isolation, leading to resource redundancy and inefficient scheduling. This paper proposes Loquetier, the first unified framework for LoRA-based large language model training and inference. It virtualizes multiple adapter modules to isolate parameter updates, enabling concurrent execution of multiple LoRAs atop a shared base model. By unifying fine-tuning and inference forward passes, Loquetier eliminates redundant kernel invocations. Furthermore, it introduces a fine-grained low-rank computation flow and optimized kernels. Experiments demonstrate that Loquetier achieves 3.0× higher throughput than state-of-the-art inference-only systems. In integrated fine-tuning–inference workloads, it improves SLO compliance rate by 46.4× over baseline approaches, significantly enhancing system efficiency and resource utilization.

Technology Category

Application Category

📝 Abstract
Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning (PEFT) technique for adapting large language models (LLMs) to downstream tasks. While prior work has explored strategies for integrating LLM training and serving, there still remains a gap in unifying fine-tuning and inference for LoRA-based models. We present Loquetier, a virtualized multi-LoRA framework that seamlessly integrates LoRA fine-tuning and serving within a single runtime. Loquetier introduces two key components: (1) a Virtualized Module that isolates PEFT-based modifications and supports multiple adapters on a shared base model, and (2) an optimized computation flow with a kernel design that merges fine-tuning and inference paths in forward propagation, enabling efficient batching and minimizing kernel invocation overhead. Extensive experiments across three task settings show that Loquetier consistently outperforms existing baselines in both performance and flexibility, achieving up to $3.0 imes$ the throughput of the state-of-the-art co-serving system on inference-only tasks and $46.4 imes$ higher SLO attainment than PEFT on unified fine-tuning and inference tasks. The implementation of Loquetier is publicly available at https://github.com/NJUDeepEngine/Loquetier.
Problem

Research questions and friction points this paper is trying to address.

Unifying LoRA fine-tuning and inference in one runtime
Supporting multiple adapters on shared base models
Optimizing computation flow for efficient batching and performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtualized Module isolates PEFT modifications and supports multiple adapters
Optimized kernel design merges fine-tuning and inference paths
Single runtime integrates LoRA fine-tuning and serving efficiently
🔎 Similar Papers
No similar papers found.
Y
Yuchen Zhang
State Key Laboratory for Novel Software Technology, Nanjing University, China
H
Hanyue Du
State Key Laboratory for Novel Software Technology, Nanjing University, China
Chun Cao
Chun Cao
Nanjing University
J
Jingwei Xu
State Key Laboratory for Novel Software Technology, Nanjing University, China