SeeDNorm: Self-Rescaled Dynamic Normalization

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

RMSNorm, while widely adopted, discards input norm information in the forward pass and employs static scaling parameters (γ), limiting zero-shot generalization across diverse data distributions. To address this, we propose SeeDNorm—a lightweight, input-adaptive dynamic normalization method built upon RMSNorm. It introduces a data-dependent, differentiable scaling mechanism that explicitly preserves and leverages input norm statistics. Crucially, scaling parameters are end-to-end trainable via backpropagation, enhancing robustness to distribution shifts without compromising training stability or inference efficiency. With minimal parameter overhead, SeeDNorm maintains architectural simplicity and computational efficiency. Extensive experiments demonstrate consistent improvements over RMSNorm, LayerNorm, and DyT across large language models and vision tasks. Gains are particularly pronounced under zero-shot evaluation and across varying model scales, validating the efficacy of dynamic norm-aware normalization for generalization.

Technology Category

Application Category

📝 Abstract

Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $γ$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $γ$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with neglligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.

Problem

Research questions and friction points this paper is trying to address.

Dynamic scaling for input variability in normalization

Preserving input norm information in forward pass

Addressing instability in dynamic normalization training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically adjusts scaling coefficient based on input

Preserves input norm information during forward pass

Maintains gradient adjustment capability during backpropagation

🔎 Similar Papers

No similar papers found.

Databricks

$166,000—$230,000 USD

San Francisco, with offices around the globe

PhD - Effiziente Neuronale Repräsentation von Datensätzen

Bosch Group

Renningen, BW, DE

Research Engineer, Monetization AI