ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the sensitivity of LoRA fine-tuning to hyperparameters and the inefficiencies in concurrent multi-task training, which often lead to computational waste and low GPU utilization. The authors propose a cooperative training system that executes multiple LoRA tasks concurrently on a shared frozen backbone. By dynamically monitoring loss trajectories, the system enables early termination of underperforming configurations. It further integrates fused-group GEMM operations with rank-local adapter parallelism to enable joint scheduling both across and within tasks. This approach is the first to exploit synergistic optimization opportunities among multiple LoRA tasks, achieving up to a 13.8× speedup over state-of-the-art methods without compromising adapter performance, thereby substantially improving cluster resource efficiency.
📝 Abstract
Low-Rank Adaptation (LoRA) is now the dominant method for parameter-efficient fine-tuning of large language models, but achieving a high-quality adapter often requires systematic hyperparameter tuning because LoRA performance is highly sensitive to configuration choices. In practice, this leads to many concurrent LoRA jobs, often spanning heterogeneous tasks in multi-tenant environments. Existing systems largely handle these jobs independently, which both wastes computation on weak candidates and leaves GPUs underutilized. We present ALTO (Adaptive LoRA Tuning and Orchestration), a co-designed training system that accelerates LoRA hyperparameter tuning while enabling efficient cluster sharing across heterogeneous tasks. The central insight behind ALTO is that when multiple tuning jobs run concurrently over a shared frozen backbone, they expose optimization opportunities that single-job designs cannot exploit. Building on this, ALTO monitors loss trajectories to terminate unpromising configurations early, uses fused grouped GEMM together with a new rank-local adapter parallelism to co-locate surviving adapters and reclaim freed GPU capacity, and combines intra-task and inter-task scheduling to improve multi-task placement by leveraging the predictable duration of LoRA jobs. Extensive evaluation shows that ALTO achieves up to $13.8\times$ speedup over state-of-the-art without sacrificing adapter quality.
Problem

Research questions and friction points this paper is trying to address.

LoRA
hyperparameter tuning
heterogeneous workloads
GPU utilization
multi-tenant
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA
hyperparameter tuning
GPU resource orchestration
early termination
adapter parallelism
🔎 Similar Papers
No similar papers found.
J
Jingwei Zuo
Rice University
X
Xinze Feng
Rice University
Z
Zien Liu
Rice University
K
Kaijian Wang
Rice University
F
Fanjiang Ye
Rice University
Y
Ye Cao
Independent Researcher
Z
Zhuang Wang
Rice University
Yuke Wang
Yuke Wang
Assistant Professor@Rice University
System for Machine LearningHigh-performance Computing