SeeDNorm: Self-Rescaled Dynamic Normalization

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
RMSNorm, while widely adopted, discards input norm information in the forward pass and employs static scaling parameters (γ), limiting zero-shot generalization across diverse data distributions. To address this, we propose SeeDNorm—a lightweight, input-adaptive dynamic normalization method built upon RMSNorm. It introduces a data-dependent, differentiable scaling mechanism that explicitly preserves and leverages input norm statistics. Crucially, scaling parameters are end-to-end trainable via backpropagation, enhancing robustness to distribution shifts without compromising training stability or inference efficiency. With minimal parameter overhead, SeeDNorm maintains architectural simplicity and computational efficiency. Extensive experiments demonstrate consistent improvements over RMSNorm, LayerNorm, and DyT across large language models and vision tasks. Gains are particularly pronounced under zero-shot evaluation and across varying model scales, validating the efficacy of dynamic norm-aware normalization for generalization.

Technology Category

Application Category

📝 Abstract
Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $γ$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $γ$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with neglligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.
Problem

Research questions and friction points this paper is trying to address.

Dynamic scaling for input variability in normalization
Preserving input norm information in forward pass
Addressing instability in dynamic normalization training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically adjusts scaling coefficient based on input
Preserves input norm information during forward pass
Maintains gradient adjustment capability during backpropagation
🔎 Similar Papers
No similar papers found.