🤖 AI Summary
The inter-layer heterogeneity of deep neural networks—e.g., residual blocks and multi-head attention modules—exhibits significant disparities in dimensionality, activation patterns, and representation characteristics, leading to excessive communication overhead and slow convergence in distributed variational inequality (VI) optimization.
Method: This paper introduces layer-aware quantization into the VI optimization framework for the first time, proposing a layer-adaptive quantization mechanism and the Quantized Optimistic Dual Averaging (QODA) algorithm.
Contribution/Results: We derive tight bounds on quantization variance and minimum code length; design an adaptive step-size strategy ensuring optimal $O(1/T)$ convergence under monotone VIs. Evaluated on a 12+ GPU cluster training Wasserstein GANs, our method achieves a 150% end-to-end speedup, substantially outperforming existing quantization-based and distributed VI approaches.
📝 Abstract
Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multi-head attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a $150%$ speedup over the baselines in end-to-end training time for training Wasserstein GAN on $12+$ GPUs.