๐ค AI Summary
This work addresses the suboptimal performance of conventional serial pipelines that separately apply quantization and low-rank adaptation (LoRA) fine-tuning, which neglect the coupling between bit-width and LoRA rank, leading to unstable performance under fixed memory budgets. To overcome this limitation, we propose AutoQRA, a framework that jointly optimizes per-layer bit-width and LoRA rank configurations within mixed-precision quantized fine-tuningโmarking the first such approach. AutoQRA integrates multi-fidelity evolutionary search with trust-region Bayesian optimization and incorporates layer importance priors to reduce evaluation costs. Under strict memory constraints, AutoQRA achieves performance significantly surpassing existing serial methods while maintaining memory usage comparable to uniform 4-bit quantization, closely approaching full-precision fine-tuning results.
๐ Abstract
Quantization followed by parameter-efficient fine-tuning has emerged as a promising paradigm for downstream adaptation under tight GPU memory constraints. However, this sequential pipeline fails to leverage the intricate interaction between quantization bit-width and LoRA rank. Specifically, a carefully optimized quantization allocation with low quantization error does not always translate to strong fine-tuning performance, and different bit-width and rank configurations can lead to significantly varying outcomes under the same memory budget. To address this limitation, we propose AutoQRA, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process. To tackle the challenges posed by the large discrete search space and the high evaluation cost associated with frequent fine-tuning iterations, AutoQRA decomposes the optimization process into two stages. First, it first conducts a global multi-fidelity evolutionary search, where the initial population is warm-started by injecting layer-wise importance priors. This stage employs specific operators and a performance model to efficiently screen candidate configurations. Second, trust-region Bayesian optimization is applied to locally refine promising regions of the search space and identify optimal configurations under the given memory budget. This approach enables active compensation for quantization noise in specific layers during training. Experiments show that AutoQRA achieves performance close to full-precision fine-tuning with a memory footprint comparable to uniform 4-bit methods.