Bug or Feature$^2$: Weight Drift, Activation Sparsity, and Spikes

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work uncovers a negative weight drift phenomenon arising from the coupling between standard loss functions—such as mean squared error and cross-entropy—and positively biased activation functions like ReLU during early training stages, which triggers a sharp increase in activation sparsity and spike-like behavior in intermediate layers. Through theoretical gradient analysis and extensive experiments across diverse architectures—including MLPs, ResNets, Vision Transformers, GPT-nano, and MP-SENet—the study establishes the optimization-theoretic nature and universality of this drift, identifying a critical sparsity threshold near 70% where accuracy precipitously declines. To mitigate these issues, the authors propose squared activation variants with gradient clipping—ReLU² and GELU²—achieving up to 90% activation sparsity in GPT-nano. Notably, clipped ReLU² substantially alleviates spiking, while GELU² yields the lowest validation loss, thereby delineating a clear trade-off boundary between sparsity and model accuracy.

📝 Abstract

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above $\sim$70\% activation sparsity. While ReLU$^2$ achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU$^2$ outperforms its unclipped version, and GELU$^2$ achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.

Problem

Research questions and friction points this paper is trying to address.

weight drift

activation sparsity

activation spikes

neural network training

asymmetric activation functions

Innovation

Methods, ideas, or system contributions that make the work stand out.

weight drift

activation sparsity

ReLU squared