🤖 AI Summary
This work addresses the challenges of synchronous blocking, excessive communication overhead, and low resource utilization in concurrent multi-task LoRA training caused by adapter heterogeneity. To overcome these issues, the authors propose an elastic shared hypermodel architecture that enables efficient batching during training by fusing multiple LoRA adapters atop a shared base model. Key innovations include an adaptive low-rank computation kernel, a residual-capacity-aware online scheduler, and a rank-aware nanobatch scheduling strategy, all integrated with distributed parallel training to optimize resource allocation. Experimental results demonstrate that the proposed method improves training throughput by 1.2–1.8×, reduces task completion time by 2.3–5.4×, and increases GPU utilization by 37%, marking the first effective solution for efficient co-training of heterogeneous LoRA tasks.
📝 Abstract
As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and na\"ive batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.