Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Layer normalization (LN) is widely assumed necessary for stable inference in large language models (LLMs), yet its actual necessity—and its impact on mechanistic interpretability—remains underexplored. Method: We systematically ablate LN from GPT-2 variants during inference, quantify performance degradation, and recover functionality via parameter rescaling and fine-grained adaptation. Contribution/Results: We demonstrate that LN is not essential for GPT-2 inference: its complete removal incurs only a +0.03 validation loss increase, and performance is fully restored without LN through lightweight compensation. This constitutes the first systematic empirical validation of LN’s weak necessity in autoregressive language modeling, enabling normalization-free Transformer architectures. We release fully LN-free GPT-2 models, which significantly enhance component-level causal analysis—enabling direct logit attribution—and reveal that “confidence neurons” deactivate in the absence of LN, yielding cleaner, more reliable probes for interpretability research.

Technology Category

Application Category

📝 Abstract

Layer-wise normalization (LN) is an essential component of virtually all transformer-based large language models. While its effects on training stability are well documented, its role at inference time is poorly understood. Additionally, LN layers hinder mechanistic interpretability by introducing additional nonlinearities and increasing the interconnectedness of individual model components. Here, we show that all LN layers can be removed from every GPT-2 model with only a small increase in validation loss (e.g. +0.03 cross-entropy loss for GPT-2 XL). Thus, LN cannot play a substantial role in language modeling. We find that the amount of fine-tuning data needed for LN removal grows sublinearly with model parameters, suggesting scaling to larger models is feasible. We release a suite of LN-free GPT-2 models on Hugging Face. Furthermore, we test interpretability techniques on LN-free models. Direct logit attribution now gives the exact direct effect of individual components, while the accuracy of attribution patching does not significantly improve. We also confirm that GPT-2's "confidence neurons" are inactive in the LN-free models. Our work clarifies the role of LN layers in language modeling, showing that GPT-2-class models can function without LN layers. We hope that our LN-free analogs of the GPT-2 family of models will enable more precise interpretability research and improve our understanding of language models.

Problem

Research questions and friction points this paper is trying to address.

Understanding LayerNorm's role in transformer inference

Removing LayerNorm to simplify mechanistic interpretability

Scaling LayerNorm removal to larger language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Remove LayerNorm layers from GPT-2 models

Fine-tune models with sublinear data growth

Enable precise interpretability without LayerNorm

🔎 Similar Papers

The Remarkable Robustness of LLMs: Stages of Inference?