Normalized Architectures are Natively 4-Bit

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Training large language models at 4-bit precision often suffers from numerical instability, necessitating additional interventions. This work proposes the nGPT architecture, which inherently enhances robustness to low-precision arithmetic by constraining both weights and hidden representations onto the unit hypersphere. The approach reveals, for the first time, the intrinsic advantage of normalized structures in ultra-low-bit training. By strengthening signal correlations, improving signal-to-noise ratio, and flattening the loss landscape, nGPT achieves more stable optimization—benefits that become increasingly pronounced with larger model scales. Using NVFP4 quantization, nGPT enables stable end-to-end training on a 1.2B dense model and on Mamba-Transformer mixture-of-experts architectures up to 30B parameters, without requiring stochastic Hadamard transforms or tensor-wise scaling.

📝 Abstract

Training large language models at 4-bit precision is critical for efficiency. We show that nGPT, an architecture that constrains weights and hidden representations to the unit hypersphere, is inherently more robust to low-precision arithmetic. This removes the need for interventions-such as applying random Hadamard transforms and performing per-tensor scaling calculations-to preserve model quality, and it enables stable end-to-end NVFP4 training. We validate this approach on both a 1.2B dense model and hybrid (Mamba-Transformer) MoE models of up to 3B/30B parameters. We trace this robustness to the dot product: while quantization noise remains largely uncorrelated in both standard and normalized architectures, the signal behaves differently. In nGPT, the hypersphere constraint enhances weak positive correlations among the element-wise products, leading to a constructive accumulation of the signal across the hidden dimension while the noise continues to average out. This yields a higher effective signal-to-noise ratio and a flatter loss landscape, with the effect strengthening as the hidden dimension grows, suggesting increasing advantages at scale. A reference implementation is available at https://github.com/anonymous452026/ngpt-nvfp4

Problem

Research questions and friction points this paper is trying to address.

4-bit training

low-precision arithmetic

large language models

quantization robustness

model stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

normalized architecture

4-bit training

nGPT