🤖 AI Summary
RMSNorm, while widely adopted, discards input norm information in the forward pass and employs static scaling parameters (γ), limiting zero-shot generalization across diverse data distributions. To address this, we propose SeeDNorm—a lightweight, input-adaptive dynamic normalization method built upon RMSNorm. It introduces a data-dependent, differentiable scaling mechanism that explicitly preserves and leverages input norm statistics. Crucially, scaling parameters are end-to-end trainable via backpropagation, enhancing robustness to distribution shifts without compromising training stability or inference efficiency. With minimal parameter overhead, SeeDNorm maintains architectural simplicity and computational efficiency. Extensive experiments demonstrate consistent improvements over RMSNorm, LayerNorm, and DyT across large language models and vision tasks. Gains are particularly pronounced under zero-shot evaluation and across varying model scales, validating the efficacy of dynamic norm-aware normalization for generalization.
📝 Abstract
Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $γ$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $γ$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with neglligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.