๐ค AI Summary
Quantization-aware training (QAT) suffers from optimization difficulties due to the piecewise-constant nature of quantizers, causing gradients to vanish almost everywhere and rendering the loss non-differentiable at quantization thresholds. To address this, we propose a loss-smoothing technique based on unbiased stochastic rounding: we construct an expectation-based surrogate loss function that is differentiable and serves as a statistically consistent approximation of the original quantized loss. We theoretically prove that this surrogate preserves convergence guarantees of standard optimizers to local minima and shares identical global optima with the original problem. Our method integrates Nesterov smoothing principles into low-precision optimization, introducing controllable stochastic noise during backpropagation to enable end-to-end differentiability. Experiments on synthetic data and large language models (150M/300M parameters) demonstrate significant improvements over conventional QATโachieving enhanced training stability, faster convergence, and higher final accuracy.
๐ Abstract
Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, extbf{L}ow-precision extbf{O}ptimization via s extbf{T}ochastic-no extbf{I}se sm extbf{O}othi extbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.