Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the computational redundancy and uncontrolled activation magnitude and variance in deep layers associated with Pre-LayerNorm in large language models, as well as the training instability of existing normalization-free approaches. The authors propose BHyT, which introduces bounded hyperbolic tangent (tanh) as a plug-and-play replacement for Pre-LayerNorm for the first time. BHyT integrates data-driven input clipping, single-pass exact statistics, and lightweight variance approximation to theoretically guarantee training stability while enhancing computational efficiency. Experimental results demonstrate that BHyT matches or exceeds the performance and robustness of RMSNorm across multiple language understanding and reasoning benchmarks, achieving an average 15.8% speedup in pretraining and a 4.2% increase in token generation throughput.

Technology Category

Application Category

📝 Abstract

Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT

Problem

Research questions and friction points this paper is trying to address.

Pre-Layer Normalization

training stability

computational efficiency

depth scaling

activation magnitude

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bounded Hyperbolic Tanh

Pre-Layer Normalization

activation stability