🤖 AI Summary
To address the prohibitively high memory footprint of full-parameter fine-tuning (FT), which hinders single-GPU deployment, this work proposes the first end-to-end INT8 integer-only full-parameter fine-tuning framework. Methodologically: (1) we theoretically establish the intrinsic robustness of the Lion optimizer to INT8 quantization; (2) we design a hybrid feature quantizer that precisely preserves sparse, critical activations; and (3) we introduce an O(1)-complexity stacked integer gradient flow, enabling fully integer arithmetic across weights, gradients, and optimizer states. Experiments on LLaMA-7B show that our method reduces peak memory consumption to <30 GB—enabling training on a single A6000 GPU—while compressing model-state memory to just 21% of the FP32 baseline, with accuracy matching FP32 FT. This work achieves the first high-fidelity, resource-efficient full-parameter fine-tuning, establishing a new paradigm for accessible, large-model training.
📝 Abstract
Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pretrained models on downstream datasets provides further significant performance gains; however, this process typically requires a large number of expensive, high-end GPUs. Although there have been efforts focused on parameter-efficient fine-tuning, they cannot fully unlock the powerful potential of full-parameter fine-tuning. In this paper, we propose QFT, a Quantized Full-parameter Tuning framework for LLMs that quantizes and stores all training states, including weights, gradients, and optimizer states, in INT8 format to reduce training memory, thereby enabling full-parameter fine-tuning on existing GPUs at an affordable cost. To ensure training performance, we make two key efforts: i) for quantized gradients and optimizer states, we theoretically prove that the Lion optimizer, with its property of consistent update magnitudes, is highly robust to quantization; ii) and for quantized weights, we employ the hybrid feature quantizer, which identifies and protects a small subset of sparse critical features while quantizing the remaining dense features, thus ensuring accurate weight updates without FP32 backups. Moreover, to support backpropagation in the integer context, we develop a stack-based gradient flow scheme with O(1) complexity, forming a unified integer training pipeline. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only<30GB of memory, making it feasible on a single A6000 GPU.