CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

To address accuracy degradation in post-training quantization of large language models caused by weight outliers, this paper proposes a calibration-free low-bit quantization method. The core innovation lies in a differentiable quantization loss surrogate function, integrated with structured learnable matrix transformations (single or dual matrices) and adaptive rounding—optimized for 4-bit and 3-bit weight representations without any input samples. This is the first method achieving fully calibration-free efficient quantization, eliminating data privacy concerns and acquisition constraints. Evaluated on Gemma-2 9B, it raises the average benchmark score from 61.9 to 62.4 at 4-bit, and achieves a substantial 8.6-point gain at 3-bit (52.0 → 60.6), with less than 3% additional computational overhead—matching the performance of calibration-dependent methods such as GPTQ.

Technology Category

Application Category

📝 Abstract

Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. However, this often introduces large errors due to outliers in the weights. Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting in some real-world scenarios as such data may be unavailable or subject to privacy regulations. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to any calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform structured matrix transformations for single matrices. For paired weights that interact directly in the computation graph, we use dual matrix transformations and adaptive rounding methods. We conduct experiments on Gemma 2 models, and observe consistent improvement over the baselines. For Gemma 2 9B quantization, our method improves the average benchmark score from 61.9 to 62.4 for 4-bit quantization and from 52.0 to 60.6 for 3-bit quantization, while adding less than 3% of computation overhead. Furthermore, our method achieves performance comparable to the commonly used GPTQ method, which requires calibration data.

Problem

Research questions and friction points this paper is trying to address.

Develops calibration-free quantization for large language models

Optimizes transformations and adaptive rounding without calibration data

Reduces quantization errors while maintaining inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned transformations optimize quantization without calibration data

Adaptive rounding method reduces quantization errors in LLMs

Structured dual matrix transformations maintain inference efficiency

🔎 Similar Papers

Optimal and Near-Optimal Adaptive Vector Quantization