Revisiting Adaptive Rounding with Vectorized Reparameterization for LLM Quantization

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the high computational and memory overhead of traditional adaptive rounding methods, which rely on dense, element-wise rounding matrices and struggle to scale to billion-parameter language models. The authors propose VQRound, a novel framework that reparameterizes rounding operations using a compact vector codebook, minimizing worst-case element-wise error under the L∞ norm. VQRound introduces codebook sharing across layers and a lightweight end-to-end fine-tuning strategy based on only 128 calibration samples. Requiring merely 0.2% trainable parameters, the method achieves faster convergence and higher quantization accuracy than existing adaptive rounding approaches across multiple large language models—including OPT, LLaMA, LLaMA2, and Qwen3—while also uncovering the critical role of rounding initialization in quantization performance.

Technology Category

Application Category

📝 Abstract

Adaptive Rounding has emerged as an alternative to round-to-nearest (RTN) for post-training quantization by enabling cross-element error cancellation. Yet, dense and element-wise rounding matrices are prohibitively expensive for billion-parameter large language models (LLMs). We revisit adaptive rounding from an efficiency perspective and propose VQRound, a parameter-efficient optimization framework that reparameterizes the rounding matrix into a compact codebook. Unlike low-rank alternatives, VQRound minimizes the element-wise worst-case error under $L_\infty$ norm, which is critical for handling heavy-tailed weight distributions in LLMs. Beyond reparameterization, we identify rounding initialization as a decisive factor and develop a lightweight end-to-end finetuning pipeline that optimizes codebooks across all layers using only 128 samples. Extensive experiments on OPT, LLaMA, LLaMA2, and Qwen3 models demonstrate that VQRound achieves better convergence than traditional adaptive rounding at the same number of steps while using as little as 0.2% of the trainable parameters. Our results show that adaptive rounding can be made both scalable and fast-fitting. The code is available at https://github.com/zhoustan/VQRound.

Problem

Research questions and friction points this paper is trying to address.

Adaptive Rounding

LLM Quantization

Post-Training Quantization

Parameter Efficiency

Rounding Matrix

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Rounding

Vectorized Reparameterization

Codebook Quantization