IterL2Norm: Fast Iterative L2-Normalization

πŸ“… 2024-12-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address memory-wall and data-movement bottlenecks induced by layer normalization in large Transformer models, this paper proposes a hardware-oriented, fast-iterative L2 normalization method. The approach integrates fixed-point Newton-Raphson iteration with adaptive scaling, achieving high-precision L2-norm computation in only five iterations under FP32/BF16β€”first of its kind. It outperforms Fast Inverse Square Root in accuracy across six/ five benchmark suites. A custom macro-cell is designed in 32/28 nm CMOS, supporting vector dimensions from 64 to 1024, with latency of 116–227 cycles at 100 MHz and 1.05 V. Hardware measurements confirm full IEEE FP32 compliance, significantly reducing off-chip memory access overhead. The design seamlessly integrates into multi-head attention and FFN backends, enabling real-time normalization for large-scale inference.

Technology Category

Application Category

πŸ“ Abstract
Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes $d$-dimensional vectors, where $64 leq d leq 1024$, with a latency of 116-227 cycles at 100MHz/1.05V.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Layer Normalization
Efficiency Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

IterL2Norm
L2 Norm Normalization
Large Language Models
C
Changmin Ye
Division of Materials Science and Engineering, Hanyang University, Seoul, Republic of Korea
Y
Yonguk Sim
Department of Semiconductor Engineering, Hanyang University, Seoul, Republic of Korea
Y
Youngchae Kim
Department of Semiconductor Engineering, Hanyang University, Seoul, Republic of Korea
S
S. Jin
Division of Materials Science and Engineering, Hanyang University, Seoul, Republic of Korea
Doo Seok Jeong
Doo Seok Jeong
Hanyang University
Neuromorphic hardware designSpiking neural network theoryLearning algorithmDeep learning accelerationNonvolatile memory