Enhancing Post-Training Quantization via Future Activation Awareness

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the sensitivity of post-training quantization (PTQ) to calibration data bias, which often leads to accumulated quantization errors and unstable performance. To mitigate this issue, the authors propose Future-Aware Quantization (FAQ), a novel approach that leverages activation information from future layers to guide the quantization of the current layer. By introducing a windowed preview mechanism, FAQ softly aggregates activations across multiple subsequent layers, thereby avoiding over-reliance on any single layer. Notably, the method operates without backpropagation, data reconstruction, or fine-tuning, and incurs negligible computational overhead. Experimental results demonstrate that FAQ significantly enhances both the robustness and accuracy of quantized models, making it highly suitable for efficient deployment on edge devices.

Technology Category

Application Category

📝 Abstract
Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning. It typically sets quantization hyperparameters (e.g., scaling factors) based on current-layer activations. Although this method is efficient, it suffers from quantization bias and error accumulation, resulting in suboptimal and unstable quantization, especially when the calibration data is biased. To overcome these issues, we propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization. This allows better identification and preservation of important weights, while reducing sensitivity to calibration noise. We further introduce a window-wise preview mechanism to softly aggregate multiple future-layer activations, mitigating over-reliance on any single layer. To avoid expensive greedy search, we use a pre-searched configuration to minimize overhead. Experiments show that FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning, making it well-suited for edge deployment.
Problem

Research questions and friction points this paper is trying to address.

Post-Training Quantization
Quantization Bias
Error Accumulation
Calibration Data Bias
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Future-Aware Quantization
Post-Training Quantization
Activation Awareness
Window-wise Preview
Calibration Robustness
Z
Zheqi Lv
Zhejiang University, Hangzhou, China
Z
Zhenxuan Fan
Zhejiang University, Hangzhou, China
Qi Tian
Qi Tian
Tencent
generative modelreinforcement learning
W
Wenqiao Zhang
Zhejiang University, Hangzhou, China
Y
Yueting Zhuang
Zhejiang University, Hangzhou, China