🤖 AI Summary
This work addresses the challenges of parameter discretization and low accuracy in quantization-aware training (QAT) for large-scale models. To this end, we propose convex piecewise-affine regularization (PAR) to implicitly guide model parameters toward discrete quantization values via automatic clustering, and design the aggregated proximal stochastic gradient method (AProx) to ensure convergence. We establish the first theoretical connection between the straight-through estimator (STE) and PAR, showing that STE emerges as the asymptotic form of PAR regularization. Furthermore, we provide a rigorous convergence analysis for QAT under the proposed framework. Extensive experiments on CNN- and Transformer-based vision tasks demonstrate that our method consistently outperforms state-of-the-art QAT baselines, achieving new SOTA quantization accuracy—thereby validating both the theoretical soundness and practical generalizability of the approach.
📝 Abstract
We develop a principled method for quantization-aware training (QAT) of large-scale machine learning models. Specifically, we show that convex, piecewise-affine regularization (PAR) can effectively induce the model parameters to cluster towards discrete values. We minimize PAR-regularized loss functions using an aggregate proximal stochastic gradient method (AProx) and prove that it has last-iterate convergence. Our approach provides an interpretation of the straight-through estimator (STE), a widely used heuristic for QAT, as the asymptotic form of PARQ. We conduct experiments to demonstrate that PARQ obtains competitive performance on convolution- and transformer-based vision tasks.