Flash normalization: fast normalization for LLMs

📅 2024-07-12

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the high computational overhead and poor fusion efficiency of RMSNorm, LayerNorm, and emerging normalization methods such as Dynamic Tanh (DyT) in large language models. We propose FlashNorm—the first zero-error, CUDA kernel-level optimized implementation of RMSNorm, enabling seamless fusion with subsequent linear layers. By jointly optimizing operator fusion, memory access reordering, and unified kernel design, FlashNorm achieves the first coordinated optimization across multiple normalization operators (RMSNorm, LayerNorm, and DyT) without precision loss. Experimental evaluation on mainstream Transformer architectures—including Llama and Mistral—demonstrates 15–25% end-to-end inference speedup and 12% reduction in GPU memory footprint. The implementation is open-sourced and integrated into major Transformer libraries, facilitating immediate adoption in production and research settings.

Technology Category

Application Category

📝 Abstract

RMSNorm is used by many LLMs such as Llama, Mistral, and OpenELM. This paper details FlashNorm, which is an exact but faster implementation of RMSNorm followed by linear layers. FlashNorm also speeds up Layer Normalization and its recently proposed replacement Dynamic Tanh (DyT) arXiv:2503.10622. See https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.

Problem

Research questions and friction points this paper is trying to address.

Faster implementation of RMSNorm for LLMs

Speeds up Layer Normalization techniques

Enhances Dynamic Tanh (DyT) performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

FlashNorm: faster exact RMSNorm implementation

Speeds up Layer Normalization and Dynamic Tanh

Optimizes normalization followed by linear layers

🔎 Similar Papers

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings