LOTION: Smoothing the Optimization Landscape for Quantized Training

๐Ÿ“… 2025-10-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Quantization-aware training (QAT) suffers from optimization difficulties due to the piecewise-constant nature of quantizers, causing gradients to vanish almost everywhere and rendering the loss non-differentiable at quantization thresholds. To address this, we propose a loss-smoothing technique based on unbiased stochastic rounding: we construct an expectation-based surrogate loss function that is differentiable and serves as a statistically consistent approximation of the original quantized loss. We theoretically prove that this surrogate preserves convergence guarantees of standard optimizers to local minima and shares identical global optima with the original problem. Our method integrates Nesterov smoothing principles into low-precision optimization, introducing controllable stochastic noise during backpropagation to enable end-to-end differentiability. Experiments on synthetic data and large language models (150M/300M parameters) demonstrate significant improvements over conventional QATโ€”achieving enhanced training stability, faster convergence, and higher final accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, extbf{L}ow-precision extbf{O}ptimization via s extbf{T}ochastic-no extbf{I}se sm extbf{O}othi extbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.
Problem

Research questions and friction points this paper is trying to address.

Addressing zero-gradient issue in quantized neural network training
Providing convergence guarantees for low-precision optimization methods
Preserving global minima while smoothing quantized loss surfaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Approximates quantized loss with continuous surface
Uses stochastic noise smoothing for optimization
Preserves global minima of original quantized loss
๐Ÿ”Ž Similar Papers
No similar papers found.
Mujin Kwun
Mujin Kwun
Harvard University
Depen Morwani
Depen Morwani
Harvard University
Machine Learning TheoryTheoretical Deep Learning
C
Chloe Huangyuan Su
Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University; Department of Computer Science, Harvard University
Stephanie Gil
Stephanie Gil
Assistant Professor, Harvard University
Networked roboticsmulti-robot control
Nikhil Anand
Nikhil Anand
Sr. Research Scientist, Kempner Institute @ Harvard
Machine LearningTheoretical Physics
S
Sham Kakade
Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University; Department of Computer Science, Harvard University