IterL2Norm: Fast Iterative L2-Normalization

📅 2024-12-06

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address memory-wall and data-movement bottlenecks induced by layer normalization in large Transformer models, this paper proposes a hardware-oriented, fast-iterative L2 normalization method. The approach integrates fixed-point Newton-Raphson iteration with adaptive scaling, achieving high-precision L2-norm computation in only five iterations under FP32/BF16—first of its kind. It outperforms Fast Inverse Square Root in accuracy across six/ five benchmark suites. A custom macro-cell is designed in 32/28 nm CMOS, supporting vector dimensions from 64 to 1024, with latency of 116–227 cycles at 100 MHz and 1.05 V. Hardware measurements confirm full IEEE FP32 compliance, significantly reducing off-chip memory access overhead. The design seamlessly integrates into multi-head attention and FFN backends, enabling real-time normalization for large-scale inference.

Technology Category

Application Category

📝 Abstract

Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes $d$-dimensional vectors, where $64 leq d leq 1024$, with a latency of 116-227 cycles at 100MHz/1.05V.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Layer Normalization

Efficiency Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

IterL2Norm

L2 Norm Normalization

Large Language Models

🔎 Similar Papers

A new Linear Time Bi-level ℓ1,∞ projection ; Application to the sparsification of auto-encoders neural networks