L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

📅 2024-02-07
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between accuracy and efficiency in quantization-aware fine-tuning of large language models (LLMs), this paper proposes QAT-LoRA, a tightly integrated framework that unifies quantization-aware training (QAT) with Low-Rank Adaptation (LoRA). It introduces memory-optimized layers supporting 3/4-bit integer quantization and employs layer-wise parameter freezing coupled with gradient reparameterization—enabling high accuracy comparable to fully quantized models while reducing training memory consumption to near-LoRA levels. Experiments on LLaMA and Mistral demonstrate that QAT-LoRA significantly outperforms decoupled approaches such as PTQ+LoRA: under 4-bit quantization, it achieves an average +2.1% improvement in task accuracy and attains few-shot performance nearly matching that of full-precision baselines. The framework thus achieves a unified optimization of low-bit quantization, high accuracy, and low training overhead.

Technology Category

Application Category

📝 Abstract
Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA), which reduce training costs, have gained significant popularity. This trend has spurred active research into quantization-aware PEFT techniques, aimed at maintaining model accuracy while minimizing memory overhead during both inference and training. Previous quantization-aware PEFT methods typically apply post-training quantization (PTQ) to pre-trained LLMs, followed by PEFT to recover accuracy loss. Meanwhile, this approach has limitations in recovering the accuracy loss. In this paper, we propose L4Q, a method that integrates Quantization-Aware Training (QAT) with LoRA. By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead, making its training cost comparable to LoRA, while preserving the advantage of QAT in producing fully quantized LLMs with high accuracy. Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy compared to decoupled fine-tuning schemes, particularly in 4-bit and 3-bit quantization, positioning L4Q as an efficient QAT solution. Using the LLaMA and Mistral models with instructional datasets, we showcase L4Q's capabilities in language tasks and few-shot learning.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory and computational costs in large language models
Integrates Quantization-Aware Training with LoRA for efficiency
Improves accuracy in low-bit quantization scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Quantization-Aware Training with LoRA
Uses memory-optimized layer design for QAT
Achieves high accuracy in 4-bit 3-bit quantization
🔎 Similar Papers
No similar papers found.