π€ AI Summary
To address memory-wall and data-movement bottlenecks induced by layer normalization in large Transformer models, this paper proposes a hardware-oriented, fast-iterative L2 normalization method. The approach integrates fixed-point Newton-Raphson iteration with adaptive scaling, achieving high-precision L2-norm computation in only five iterations under FP32/BF16βfirst of its kind. It outperforms Fast Inverse Square Root in accuracy across six/ five benchmark suites. A custom macro-cell is designed in 32/28 nm CMOS, supporting vector dimensions from 64 to 1024, with latency of 116β227 cycles at 100 MHz and 1.05 V. Hardware measurements confirm full IEEE FP32 compliance, significantly reducing off-chip memory access overhead. The design seamlessly integrates into multi-head attention and FFN backends, enabling real-time normalization for large-scale inference.
π Abstract
Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes $d$-dimensional vectors, where $64 leq d leq 1024$, with a latency of 116-227 cycles at 100MHz/1.05V.