Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance

📅 2024-07-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Quantized LLM fine-tuning suffers from “balance mismatch”: LoRA adapters exhibit high input/output complexity yet low effective trainability, leading to underfitting; subsequent low-precision conversion further degrades performance. Method: We propose Q-BaRA and QA-HiRA—Q-BaRA enhances adapter effectiveness via structural simplification and joint rank optimization; QA-HiRA introduces a single-matrix high-rank adapter coupled with block-wise quantization alignment, enabling lossless integration of fine-tuned parameters into 4/8-bit inference models. Contribution/Results: This work is the first to identify and characterize this imbalance, establishing a new quantization-aware end-to-end fine-tuning paradigm. Evaluated on LLaMA and LLaMA2, our methods significantly outperform state-of-the-art approaches in accuracy, while preserving parameter count and computational cost during training and incurring zero additional deployment overhead.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated impressive performance across various domains. However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA), greatly reducing memory usage but resulting in noticeable performance degradation. In this paper, we identify an imbalance in fine-tuning quantized pre-trained models: overly complex adapter inputs and outputs versus low effective trainability of the adaptation. We propose Quantized LLMs with Balanced-rank Adaptation (Q-BaRA), which simplifies the adapter inputs and outputs while increasing the adapter's rank to achieve a more suitable balance for fine-tuning quantized LLMs. Additionally, for scenarios where fine-tuned LLMs need to be deployed as low-precision inference models, we introduce Quantization-Aware Fine-tuning with Higher Rank Adaptation (QA-HiRA), which simplifies the adapter inputs and outputs to align with the pre-trained model's block-wise quantization while employing a single matrix to achieve a higher rank. Both Q-BaRA and QA-HiRA are easily implemented and offer the following optimizations: (i) Q-BaRA consistently achieves the highest accuracy compared to baselines and other variants, requiring the same number of trainable parameters and computational effort; (ii) QA-HiRA naturally merges adapter parameters into the block-wise quantized model after fine-tuning, achieving the highest accuracy compared to other methods. We apply our Q-BaRA and QA-HiRA to the LLaMA and LLaMA2 model families and validate their effectiveness across different fine-tuning datasets and downstream scenarios. Code will be made available at href{https://github.com/xiaocaigou/qbaraqahira}{https://github.com/xiaocaigou/qbaraqahira}
Problem

Research questions and friction points this paper is trying to address.

Address performance degradation in quantized LLM fine-tuning
Balance adapter complexity and trainability to reduce underfitting
Enable efficient low-precision deployment with quantization-aware fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Balanced Low-Rank Adaptation for quantized LLMs
Simplifies adapter inputs and outputs
Increases adapter rank to reduce underfitting
🔎 Similar Papers
No similar papers found.
Ao Shen
Ao Shen
Purdue University
machine learning system and architecture
Q
Qiang Wang
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Hunan Changsha 410073, China
Z
Zhiquan Lai
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Hunan Changsha 410073, China
X
Xiong-lve Li
College of Computer, National University of Defense Technology, Hunan Changsha 410073, China
D
Dongsheng Li
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Hunan Changsha 410073, China