tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of synchronous blocking, excessive communication overhead, and low resource utilization in concurrent multi-task LoRA training caused by adapter heterogeneity. To overcome these issues, the authors propose an elastic shared hypermodel architecture that enables efficient batching during training by fusing multiple LoRA adapters atop a shared base model. Key innovations include an adaptive low-rank computation kernel, a residual-capacity-aware online scheduler, and a rank-aware nanobatch scheduling strategy, all integrated with distributed parallel training to optimize resource allocation. Experimental results demonstrate that the proposed method improves training throughput by 1.2–1.8×, reduces task completion time by 2.3–5.4×, and increases GPU utilization by 37%, marking the first effective solution for efficient co-training of heterogeneous LoRA tasks.

Technology Category

Application Category

📝 Abstract
As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and na\"ive batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.
Problem

Research questions and friction points this paper is trying to address.

LoRA
multi-LoRA training
heterogeneous adapters
training co-location
resource sharing
Innovation

Methods, ideas, or system contributions that make the work stand out.

tLoRA
elastic shared super-model
fused LoRA kernel
rank-aware nano-batching
residual-capacity-aware scheduling
🔎 Similar Papers
No similar papers found.
K
Kevin Li
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Champaign, Illinois, United States
D
Dibyadeep Saha
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Champaign, Illinois, United States
A
Avni Kanodia
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Champaign, Illinois, United States
Fan Lai
Fan Lai
University of Illinois Urbana-Champaign
Machine Learning SystemsCloud ComputingMachine Learning