🤖 AI Summary
ReLU suffers from the “dying neuron” problem, leading to vanishing gradients and degraded performance. To address this, we propose SUGAR: a backward-compatible activation method that retains standard ReLU in the forward pass—preserving sparsity and computational efficiency—while introducing a learnable, smooth surrogate gradient in the backward pass to dynamically revive dead neurons, replacing the conventional zero gradient. Crucially, SUGAR is the first to formalize the surrogate gradient as a ReLU-specific regularizer, systematically mitigating neuron death without altering forward propagation. The method is plug-and-play and compatible with mainstream architectures including VGG-16, ResNet-18, ConvNeXt, and Swin Transformer. Experiments across multiple vision tasks demonstrate that SUGAR consistently outperforms advanced activations such as GELU and SELU, yielding improved generalization, enhanced activation sparsity, and effective reactivation of previously dead ReLU units.
📝 Abstract
Modeling sophisticated activation functions within deep learning architectures has evolved into a distinct research direction. Functions such as GELU, SELU, and SiLU offer smooth gradients and improved convergence properties, making them popular choices in state-of-the-art models. Despite this trend, the classical ReLU remains appealing due to its simplicity, inherent sparsity, and other advantageous topological characteristics. However, ReLU units are prone to becoming irreversibly inactive - a phenomenon known as the dying ReLU problem - which limits their overall effectiveness. In this work, we introduce surrogate gradient learning for ReLU (SUGAR) as a novel, plug-and-play regularizer for deep architectures. SUGAR preserves the standard ReLU function during the forward pass but replaces its derivative in the backward pass with a smooth surrogate that avoids zeroing out gradients. We demonstrate that SUGAR, when paired with a well-chosen surrogate function, substantially enhances generalization performance over convolutional network architectures such as VGG-16 and ResNet-18, providing sparser activations while effectively resurrecting dead ReLUs. Moreover, we show that even in modern architectures like Conv2NeXt and Swin Transformer - which typically employ GELU - substituting these with SUGAR yields competitive and even slightly superior performance. These findings challenge the prevailing notion that advanced activation functions are necessary for optimal performance. Instead, they suggest that the conventional ReLU, particularly with appropriate gradient handling, can serve as a strong, versatile revived classic across a broad range of deep learning vision models.