🤖 AI Summary
This work addresses the gradient distortion caused by conventional straight-through estimators (STE) in training fully binary neural networks and the gradient blocking induced by activation binarization in existing progressive freezing strategies. The authors propose StoMPP, a novel method that employs inter-layer stochastic masking to enable layer-wise progressive binarization of both weights and activations, coupled with selective backpropagation applied only to unfrozen layers. This approach achieves, for the first time, end-to-end training of deep fully binary networks without relying on STE. StoMPP reveals a new mechanism for non-monotonic convergence and depth scalability under binary constraints, yielding substantial performance gains: on ResNet-50, it improves accuracy by 18.0, 13.5, and 3.8 percentage points on CIFAR-10, CIFAR-100, and ImageNet, respectively, with binary-weight networks attaining 91.2% and 69.5% accuracy on CIFAR-10 and CIFAR-100.
📝 Abstract
We investigate progressive freezing as an alternative to straight-through estimators (STE) for training binary networks from scratch. Under controlled training conditions, we find that while global progressive freezing works for binary-weight networks, it fails for full binary neural networks due to activation-induced gradient blockades. We introduce StoMPP (Stochastic Masked Partial Progressive Binarization), which uses layerwise stochastic masking to progressively replace differentiable clipped weights/activations with hard binary step functions, while only backpropagating through the unfrozen (clipped) subset (i.e., no straight-through estimator). Under a matched minimal training recipe, StoMPP improves accuracy over a BinaryConnect-style STE baseline, with gains that increase with depth (e.g., for ResNet-50 BNN: +18.0 on CIFAR-10, +13.5 on CIFAR-100, and +3.8 on ImageNet; for ResNet-18: +3.1, +4.7, and +1.3). For binary-weight networks, StoMPP achieves 91.2\% accuracy on CIFAR-10 and 69.5\% on CIFAR-100 with ResNet-50. We analyze training dynamics under progressive freezing, revealing non-monotonic convergence and improved depth scaling under binarization constraints.