Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the severe performance degradation caused by W4A4 quantization in SwiGLU-based language models, where validation perplexity surges to 1727. The authors introduce a “reader/generator” decomposition perspective, revealing that quantization error predominantly originates from the generator path within each block. To mitigate this, they propose Depth Registers combined with a register magnitude hinge loss (DR+sink) to selectively suppress error along the residual axis. Integrating SmoothQuant, Per-Linear QuaRot, and online Hadamard rotation, their method dramatically reduces perplexity to 39.9—approaching the FP16 baseline of 23.6—while preserving zero-shot capabilities. This approach represents the first effective low-bit quantization strategy specifically tailored for SwiGLU modules.

Technology Category

Application Category

📝 Abstract

We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot -- adding online per-head value Hadamard plus online w2-input rotation -- does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.

Problem

Research questions and friction points this paper is trying to address.

W4A4 quantization

SwiGLU

activation quantization

perplexity degradation

bilinear activation

Innovation

Methods, ideas, or system contributions that make the work stand out.

W4A4 quantization

SwiGLU decomposition

Depth Registers