RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant performance degradation of large language models (LLMs) under low-bit (≤4-bit) post-training quantization after supervised fine-tuning (SFT), this paper proposes a novel quantization-aware supervised fine-tuning (QASFT) paradigm. The core innovation is the Rotational Straight-Through Estimator (RoSTE), the first method to tightly integrate adaptive rotational transformation with quantization-aware training; we theoretically prove that RoSTE effectively suppresses activation outliers and reduces prediction error. Our approach encompasses rotational matrix transformation, weight quantization (≤4-bit), joint quantization of activations and KV caches, quantization-aware backpropagation, and over-parameterized least-squares quantization training analysis. Evaluated on Pythia and Llama architectures, our method consistently outperforms post-training quantization baselines across diverse tasks and model scales, maintaining high accuracy while achieving, for the first time, fully coordinated low-bit (≤4-bit) quantization of weights, activations, and KV caches.

Technology Category

Application Category

📝 Abstract
Supervised fine-tuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations, and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction error when applied to an overparameterized least square quantized training problem. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration. Experiments on Pythia and Llama models of different sizes demonstrate the effectiveness of RoSTE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures.
Problem

Research questions and friction points this paper is trying to address.

Efficient quantization-aware fine-tuning
Reduces activation outliers in LLMs
Superior performance across LLM architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines quantization-aware fine-tuning
Uses adaptive rotation strategy
Manages prediction error effectively
🔎 Similar Papers
No similar papers found.