Where You Place the Norm Matters: From Prejudiced to Neutral Initializations

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how the placement of normalization layers (e.g., BatchNorm, LayerNorm) within neural network hidden layers affects the predictive distribution at initialization—particularly the initial bias across class predictions. Method: Leveraging a novel synthesis of random matrix theory and Gaussian process approximations, we establish the first theoretical link between normalization position and initialization-time prediction statistics, introducing the “initialization bias degree” as a quantitative metric. Results: We demonstrate that pre-normalization systematically attenuates initial class bias, driving the output logits toward a uniform, neutral distribution at initialization. This mechanism markedly improves training stability and convergence consistency across architectures—including fully connected networks, CNNs, and Transformers—without architectural or hyperparameter modification. Empirical validation on multiple benchmarks confirms its interpretable, path-level control over optimization dynamics.

Technology Category

Application Category

📝 Abstract
Normalization layers, such as Batch Normalization and Layer Normalization, are central components in modern neural networks, widely adopted to improve training stability and generalization. While their practical effectiveness is well documented, a detailed theoretical understanding of how normalization affects model behavior, starting from initialization, remains an important open question. In this work, we investigate how both the presence and placement of normalization within hidden layers influence the statistical properties of network predictions before training begins. In particular, we study how these choices shape the distribution of class predictions at initialization, which can range from unbiased (Neutral) to highly concentrated (Prejudiced) toward a subset of classes. Our analysis shows that normalization placement induces systematic differences in the initial prediction behavior of neural networks, which in turn shape the dynamics of learning. By linking architectural choices to prediction statistics at initialization, our work provides a principled understanding of how normalization can influence early training behavior and offers guidance for more controlled and interpretable network design.
Problem

Research questions and friction points this paper is trying to address.

How normalization placement affects initial network predictions
Impact of normalization on class prediction distribution at initialization
Linking architectural choices to early training behavior dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Normalization placement affects initial predictions
Links architectural choices to prediction statistics
Guides interpretable neural network design
🔎 Similar Papers
No similar papers found.
Emanuele Francazi
Emanuele Francazi
PhD student, École Polytechnique Fédérale de Lausanne
statistical mechanicsmachine learningdisordered systems
Francesco Pinto
Francesco Pinto
Research Scientist, Google Deepmind
Agentic AI Safety and Security
A
Aurélien Lucchi
Department of Mathematics and Computer Science, University of Basel, 4051 Basel, Switzerland
M
M. Baity-Jesi
SIAM Department, Eawag (ETH), 8600 Dübendorf, Switzerland