Flash normalization: fast normalization for LLMs

📅 2024-07-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational overhead and poor fusion efficiency of RMSNorm, LayerNorm, and emerging normalization methods such as Dynamic Tanh (DyT) in large language models. We propose FlashNorm—the first zero-error, CUDA kernel-level optimized implementation of RMSNorm, enabling seamless fusion with subsequent linear layers. By jointly optimizing operator fusion, memory access reordering, and unified kernel design, FlashNorm achieves the first coordinated optimization across multiple normalization operators (RMSNorm, LayerNorm, and DyT) without precision loss. Experimental evaluation on mainstream Transformer architectures—including Llama and Mistral—demonstrates 15–25% end-to-end inference speedup and 12% reduction in GPU memory footprint. The implementation is open-sourced and integrated into major Transformer libraries, facilitating immediate adoption in production and research settings.

Technology Category

Application Category

📝 Abstract
RMSNorm is used by many LLMs such as Llama, Mistral, and OpenELM. This paper details FlashNorm, which is an exact but faster implementation of RMSNorm followed by linear layers. FlashNorm also speeds up Layer Normalization and its recently proposed replacement Dynamic Tanh (DyT) arXiv:2503.10622. See https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.
Problem

Research questions and friction points this paper is trying to address.

Faster implementation of RMSNorm for LLMs
Speeds up Layer Normalization techniques
Enhances Dynamic Tanh (DyT) performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

FlashNorm: faster exact RMSNorm implementation
Speeds up Layer Normalization and Dynamic Tanh
Optimizes normalization followed by linear layers
🔎 Similar Papers
2024-06-27Conference on Empirical Methods in Natural Language ProcessingCitations: 2
N
Nils Graef
OpenMachine, San Francisco Bay Area
M
Matthew Clapp
OpenMachine, San Francisco Bay Area
A
Andrew Wasielewski
OpenMachine, San Francisco Bay Area