LOTION: Smoothing the Optimization Landscape for Quantized Training

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Quantization-aware training (QAT) suffers from optimization difficulties due to the piecewise-constant nature of quantizers, causing gradients to vanish almost everywhere and rendering the loss non-differentiable at quantization thresholds. To address this, we propose a loss-smoothing technique based on unbiased stochastic rounding: we construct an expectation-based surrogate loss function that is differentiable and serves as a statistically consistent approximation of the original quantized loss. We theoretically prove that this surrogate preserves convergence guarantees of standard optimizers to local minima and shares identical global optima with the original problem. Our method integrates Nesterov smoothing principles into low-precision optimization, introducing controllable stochastic noise during backpropagation to enable end-to-end differentiability. Experiments on synthetic data and large language models (150M/300M parameters) demonstrate significant improvements over conventional QAT—achieving enhanced training stability, faster convergence, and higher final accuracy.

Technology Category

Application Category

📝 Abstract

Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, extbf{L}ow-precision extbf{O}ptimization via s extbf{T}ochastic-no extbf{I}se sm extbf{O}othi extbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.

Problem

Research questions and friction points this paper is trying to address.

Addressing zero-gradient issue in quantized neural network training

Providing convergence guarantees for low-precision optimization methods

Preserving global minima while smoothing quantized loss surfaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Approximates quantized loss with continuous surface

Uses stochastic noise smoothing for optimization

Preserves global minima of original quantized loss

🔎 Similar Papers

No similar papers found.