Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational redundancy and uncontrolled activation magnitude and variance in deep layers associated with Pre-LayerNorm in large language models, as well as the training instability of existing normalization-free approaches. The authors propose BHyT, which introduces bounded hyperbolic tangent (tanh) as a plug-and-play replacement for Pre-LayerNorm for the first time. BHyT integrates data-driven input clipping, single-pass exact statistics, and lightweight variance approximation to theoretically guarantee training stability while enhancing computational efficiency. Experimental results demonstrate that BHyT matches or exceeds the performance and robustness of RMSNorm across multiple language understanding and reasoning benchmarks, achieving an average 15.8% speedup in pretraining and a 4.2% increase in token generation throughput.

Technology Category

Application Category

📝 Abstract
Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT
Problem

Research questions and friction points this paper is trying to address.

Pre-Layer Normalization
training stability
computational efficiency
depth scaling
activation magnitude
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bounded Hyperbolic Tanh
Pre-Layer Normalization
activation stability
efficient training
large language models