MinT: Managed Infrastructure for Training and Serving Millions of LLMs

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
This work addresses the substantial storage and serving overhead incurred by fine-tuning large language models with massive numbers of LoRA adapters by introducing the MindLab Toolkit (MinT), a hosting framework that shares a common base model while transmitting only lightweight LoRA adapters and managing their full lifecycle uniformly. Key innovations include support for catalogs of up to one million LoRA strategies, adapter compression to less than 1% of the base model size, decoupling of persistent storage from computational address spaces, and integration of tensor parallelism, GRPO-based concurrent multi-strategy training, batched MoE-LoRA loading, and cold-start-aware scheduling. Experiments demonstrate an 18.3× inference speedup on a 4B dense model and a 2.85× speedup on a 30B MoE model; a single engine can traverse 100,000 strategies, clusters support over a thousand concurrent requests, and MoE loading efficiency improves by 8.5–8.7×.
📝 Abstract
We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.
Problem

Research questions and friction points this paper is trying to address.

LoRA
large language models
model serving
scalable infrastructure
parameter-efficient fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA
managed infrastructure
model serving
MoE
parameter-efficient fine-tuning
🔎 Similar Papers
2024-08-10AAAI Conference on Artificial IntelligenceCitations: 30
Song Cao
Song Cao
University of Southern California
Computer Vision
V
Vic Cao
Mind Lab
A
Andrew Chen
Mind Lab
K
Kaijie Chen
Mind Lab
C
Cleon Cheng
Mind Lab
S
Steven Chiang
Mind Lab
Kaixuan Fan
Kaixuan Fan
MSc Student, Imperial College London
Virtual RealityHuman-Computer InteractionComputer Vision
H
Hera Feng
Mind Lab
H
Huan Feng
Mind Lab
A
Arthur Fu
Mind Lab
J
Jun Gao
Mind Lab
H
Hongquan Gu
Mind Lab
A
Aaron Guan
Mind Lab
N
Nolan Ho
Mind Lab
M
Mutian Hong
Mind Lab
H
Hailee Hou
Mind Lab
P
Peixuan Hua
Mind Lab
C
Charles Huang
Mind Lab
M
Miles Jiang
Mind Lab
N
Nora Jiang
Mind Lab
Y
Yuyi Jiang
Mind Lab
Q
Qiuyu Jin
Mind Lab
F
Fancy Kong
Mind Lab