AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning

๐Ÿ“… 2026-02-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the suboptimal performance of conventional serial pipelines that separately apply quantization and low-rank adaptation (LoRA) fine-tuning, which neglect the coupling between bit-width and LoRA rank, leading to unstable performance under fixed memory budgets. To overcome this limitation, we propose AutoQRA, a framework that jointly optimizes per-layer bit-width and LoRA rank configurations within mixed-precision quantized fine-tuningโ€”marking the first such approach. AutoQRA integrates multi-fidelity evolutionary search with trust-region Bayesian optimization and incorporates layer importance priors to reduce evaluation costs. Under strict memory constraints, AutoQRA achieves performance significantly surpassing existing serial methods while maintaining memory usage comparable to uniform 4-bit quantization, closely approaching full-precision fine-tuning results.

Technology Category

Application Category

๐Ÿ“ Abstract
Quantization followed by parameter-efficient fine-tuning has emerged as a promising paradigm for downstream adaptation under tight GPU memory constraints. However, this sequential pipeline fails to leverage the intricate interaction between quantization bit-width and LoRA rank. Specifically, a carefully optimized quantization allocation with low quantization error does not always translate to strong fine-tuning performance, and different bit-width and rank configurations can lead to significantly varying outcomes under the same memory budget. To address this limitation, we propose AutoQRA, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process. To tackle the challenges posed by the large discrete search space and the high evaluation cost associated with frequent fine-tuning iterations, AutoQRA decomposes the optimization process into two stages. First, it first conducts a global multi-fidelity evolutionary search, where the initial population is warm-started by injecting layer-wise importance priors. This stage employs specific operators and a performance model to efficiently screen candidate configurations. Second, trust-region Bayesian optimization is applied to locally refine promising regions of the search space and identify optimal configurations under the given memory budget. This approach enables active compensation for quantization noise in specific layers during training. Experiments show that AutoQRA achieves performance close to full-precision fine-tuning with a memory footprint comparable to uniform 4-bit methods.
Problem

Research questions and friction points this paper is trying to address.

mixed-precision quantization
low-rank adapters
LLM fine-tuning
memory-constrained adaptation
quantization-aware optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-precision quantization
low-rank adapters
joint optimization
evolutionary search
Bayesian optimization
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Changhai Zhou
Fudan University
S
Shiyang Zhang
Yale University
Y
Yuhua Zhou
Zhejiang University
Q
Qian Qiao
OpenWPLab
J
Jun Gao
Yale University
Cheng Jin
Cheng Jin
Fudan University
Image and Video ProcessingComputer VisionHCI
K
Kaizhou Qin
Fudan University
Weizhong Zhang
Weizhong Zhang
Fudan University
Machine LearningDeep LearningOptimization