LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the substantial memory and computational overhead incurred by parallel loading of multiple LoRA adapters, this paper proposes a LoRA-specific mixed-precision quantization method. Our approach first reparameterizes LoRA matrices via singular value decomposition (SVD) to concentrate representational information into singular vectors and values. Subsequently, we introduce a channel-wise mixed-precision quantization scheme: dominant singular vectors and large singular values are preserved at higher precision (4–8 bits), while less informative components are aggressively compressed to 1–2 bits. Evaluated on LLaMA-2 (7B/13B) and Mistral-7B, our method achieves significantly lower average bit-widths than state-of-the-art quantization baselines. It maintains or even improves performance on mathematical reasoning, code generation, and summarization tasks, while reducing GPU memory consumption by over 40%. The proposed technique thus offers an effective trade-off between efficiency and task robustness in multi-LoRA inference.

Technology Category

Application Category

📝 Abstract
Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks. Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Results show that our LoRAQuant uses significantly lower bits than other quantization methods, but achieves comparable or even higher performance.
Problem

Research questions and friction points this paper is trying to address.

Reducing aggregate memory cost of multiple LoRA adapters
Enabling ultra-low bit quantization for LoRA components
Maintaining model performance while significantly lowering bitwidth
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-precision post-training quantization for LoRA
Reparameterizes adapters via SVD decomposition
Quantizes important components to higher precision
🔎 Similar Papers
No similar papers found.
A
Amir Reza Mirzaei
Dept. Computing Science & Alberta Machine Intelligence Institute (Amii), University of Alberta
Y
Yuqiao Wen
Dept. Computing Science & Alberta Machine Intelligence Institute (Amii), University of Alberta
Yanshuai Cao
Yanshuai Cao
Borealis AI
Artificial IntelligenceMachine LearningGenerative ModelsNatural Language ProcessingComputer Vision
Lili Mou
Lili Mou
University of Alberta
Natural Language ProcessingMachine Learning