Improving Quantization with Post-Training Model Expansion

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the accuracy degradation of large language models (LLMs) under low-bit quantization, this paper proposes a post-training model expansion method that dynamically enhances 4-bit quantized LLM performance without retraining. The core innovation lies in the first systematic validation of synergistic co-design between post-training expansion and quantization. We introduce a selective, progressive parameter expansion mechanism integrated with Hadamard rotation, high-precision retention of sensitive weights, layer-wise expansion, and quantization-aware structural adaptation. Evaluated on Llama3-1B, our approach achieves full 4-bit weight and activation quantization, attaining an average zero-shot accuracy improvement of 3% over QuaRot and SpinQuant, with only a 5% parameter increase. The resulting model size is reduced by 3.8% relative to the BF16 baseline—breaking the conventional “compression implies parameter reduction” paradigm and enabling joint optimization of accuracy and efficiency.

Technology Category

Application Category

📝 Abstract
The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically focused on reducing the overall volume of pre-trained models to reduce inference costs while maintaining model quality. However, recent advancements have introduced optimization techniques that, interestingly, expand models post-training, increasing model size to improve quality when reducing volume. For instance, to enable 4-bit weight and activation quantization, incoherence processing often necessitates inserting online Hadamard rotations in the compute graph, and preserving highly sensitive weights often calls for additional higher precision computations. However, if application requirements cannot be met, the prevailing solution is to relax quantization constraints. In contrast, we demonstrate post-training model expansion is a viable strategy to improve model quality within a quantization co-design space, and provide theoretical justification. We show it is possible to progressively and selectively expand the size of a pre-trained large language model (LLM) to improve model quality without end-to-end retraining. In particular, when quantizing the weights and activations to 4 bits for Llama3 1B, we reduce the zero-shot accuracy gap to full precision by an average of 3% relative to both QuaRot and SpinQuant with only 5% more parameters, which is still a 3.8% reduction in volume relative to a BF16 reference model.
Problem

Research questions and friction points this paper is trying to address.

Improves model quality via post-training expansion during quantization
Reduces accuracy gap in 4-bit LLMs without full retraining
Balances quantization constraints with selective parameter increase
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training model expansion improves quantization quality
Selective expansion without end-to-end retraining
Hadamard rotations and higher precision computations
🔎 Similar Papers
No similar papers found.