🤖 AI Summary
This study uncovers a “layer collapse” phenomenon in early layers of diffusion language models, wherein activations are dominated by a single super outlier—a behavior markedly distinct from autoregressive models. Through comprehensive analysis of the LLaDA-8B model, integrating activation profiling, controlled pretraining experiments, GPTQ quantization, and sparsity allocation strategies, the work establishes for the first time that this collapse stems from overtraining rather than undertraining, and exhibits a redundancy distribution opposite to that of autoregressive counterparts. These findings introduce a new paradigm for model compression: diffusion models retain high fidelity under aggressive quantization, suffering only a 1.8% accuracy drop at 3-bit precision—substantially outperforming autoregressive models—and further achieve up to an 8.4% performance gain through optimized sparsity allocation.
📝 Abstract
Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier persisting over a long token range. Despite its apparent redundancy, this outlier is critical: pruning it causes outputs to degrade into repetitive random token loops. Paradoxically, layers in LLaDA contain more redundant representations overall, with redundancy most pronounced in earlier layers -- the reverse of AR models, where deeper layers grow redundant due to undertraining. Our analysis indicates that layer collapse in DLMs is not driven by undertraining but by overtraining: a dominant outlier becomes an indispensable information carrier while remaining representations collapse into redundant structure. These findings have strong practical implications, verified through controlled pre-training experiments. DLMs are surprisingly robust to compression: LLaDA under 3-bit GPTQ quantization drops only -1.8% on GSM8K, whereas Llama-3.1-8B drops -64.7%. Optimal sparsity allocation also reverses between families: at 50% average sparsity, allocating more to early layers in LLaDA yields +8.4% over the reverse strategy, while the same allocation costs Llama -8.4%. Our findings reveal that the DLM training objective fundamentally reshapes layer dynamics relative to AR models, with direct consequences for compression and deployment. Code: github.com/Conzel/super-outlier-dlm.