đ¤ AI Summary
This work addresses the challenge of vanishing or exploding activations and gradients in deep neural networks when scale control mechanisms like batch normalization are unavailableâsuch as in physics-informed neural networks (PINNs)âwhich often leads to unstable training. The authors propose StableGrad, an optimizer-level inter-layer gradient rescaling mechanism that adaptively corrects weight gradients after backpropagation without altering the forward architecture, adding normalization layers, or employing residual connections. This preserves the physical consistency of both the model output and its derivatives. StableGrad enables stable training without any architectural modifications, significantly improving convergence and solution accuracy in deep PINNs and in ResNet/EfficientNet variants with batch normalization removed, offering a general-purpose, plug-and-play optimization strategy for scenarios where batch normalization is inapplicable.
đ Abstract
Training very deep neural networks requires controlling the propagation of magnitudes across depth. Without such control, activations and gradients may vanish, explode, or enter unstable regimes that make optimization fail. Modern architectures often mitigate this problem through Batch Normalization, residual connections, or other normalization layers, which repeatedly re-scale or bypass intermediate representations. However, these mechanisms are not always appropriate. In Physics-Informed Neural Networks (PINNs), the network represents a continuous physical field and its input derivatives define the training objective, making batch-dependent normalization problematic because it can introduce non-local dependencies into the predicted field and its derivatives. We propose StableGrad, an optimizer-level scale-control mechanism that corrects layer-wise weight-gradient imbalances without modifying the forward model. Because the normalization is applied only after backpropagation and before the optimizer update, the network output, its derivatives, and the physical residual remain unchanged. We analyze the effective training dynamics induced by this rescaling and evaluate StableGrad on deep PINNs as the target application, with BatchNorm-free convolutional networks serving as a diagnostic stress test. On PINN benchmarks, StableGrad improves matched-depth solution accuracy and makes deeper models more reliable under standard optimization. On ResNet and EfficientNet architectures, where removing Batch Normalization normally leads to training collapse, StableGrad stabilizes optimization without introducing any other architectural change. These results show that optimizer-level control of weight-gradient scale can provide a practical alternative when forward normalization is unavailable or undesirable.