LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Quantized LLM fine-tuning for edge devices faces three key challenges: (1) data-type mismatch between low-bit quantized weights and high-precision adaptation parameters; (2) accuracy degradation due to adapter weight merging; and (3) inability to fully update quantized weights end-to-end. This paper proposes the lossless Ternary Adaptation (TA) framework, introducing two novel techniques: ternary-sign stochastic gradient descent (t-SignSGD) and quantization-aware fusion. TA aligns ternary adaptation weights precisely with the quantized weight grid, enables end-to-end optimization of fully quantized weights, and achieves zero-loss merging of adaptation parameters. Evaluated on MMLU, TA recovers performance significantly—outperforming 16-bit LoRA by 5.14%. It demonstrates consistent effectiveness across Llama-3.1, Llama-3.3, and Qwen-2.5, surpassing all existing quantized fine-tuning methods. Notably, TA is the first approach to achieve completely lossless weight fusion in quantized LLM fine-tuning.

Technology Category

Application Category

📝 Abstract
Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices. However, fine-tuning quantized models presents significant challenges, primarily stemming from: First, the mismatch in data types between the low-precision quantized weights (e.g., 4-bit) and the high-precision adaptation weights (e.g., 16-bit). This mismatch limits the computational efficiency advantage offered by quantized weights during inference. Second, potential accuracy degradation when merging these high-precision adaptation weights into the low-precision quantized weights, as the adaptation weights often necessitate approximation or truncation. Third, as far as we know, no existing methods support the lossless merging of adaptation while adjusting all quantized weights. To address these challenges, we introduce lossless ternary adaptation for quantization-aware fine-tuning (LoTA-QAF). This is a novel fine-tuning method specifically designed for quantized LLMs, enabling the lossless merging of ternary adaptation weights into quantized weights and the adjustment of all quantized weights. LoTA-QAF operates through a combination of: i) A custom-designed ternary adaptation (TA) that aligns ternary weights with the quantization grid and uses these ternary weights to adjust quantized weights. ii) A TA-based mechanism that enables the lossless merging of adaptation weights. iii) Ternary signed gradient descent (t-SignSGD) for updating the TA weights. We apply LoTA-QAF to Llama-3.1/3.3 and Qwen-2.5 model families and validate its effectiveness on several downstream tasks. On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14%. For task-specific fine-tuning, 16-bit LoRA achieves superior results, but LoTA-QAF still outperforms other methods.
Problem

Research questions and friction points this paper is trying to address.

Mismatch between low-precision quantized and high-precision adaptation weights
Accuracy loss when merging high-precision weights into quantized weights
No existing methods support lossless merging of adaptation weights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lossless ternary adaptation for quantized models
Ternary weights aligned with quantization grid
Ternary signed gradient descent for updates
J
Junyu Chen
Southwestern University of Finance and Economics, Financial Intelligence and Financial Engineering Key Laboratory of Sichuan Province
J
Junzhuo Li
The Hong Kong University of Science and Technology (Guangzhou)
Z
Zhen Peng
Sun Yat-sen University
W
Wenjie Wang
Southwestern University of Finance and Economics, Financial Intelligence and Financial Engineering Key Laboratory of Sichuan Province
Yuxiang Ren
Yuxiang Ren
Tenure-track Assistant Professor, Nanjing University
Graph Neural NetworkAI for ScienceFoundation Model
L
Long Shi
Southwestern University of Finance and Economics, Financial Intelligence and Financial Engineering Key Laboratory of Sichuan Province
Xuming Hu
Xuming Hu
Assistant Professor, HKUST(GZ) / HKUST
Natural Language ProcessingLarge Language Model