Measuring Maximum Activations in Open Large Language Models

πŸ“… 2026-05-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

227K/year
πŸ€– AI Summary
This study addresses the lack of systematic measurement of activation dynamic ranges in current open-source large language models, which hinders low-bit quantization and stable deployment. The authors establish a unified evaluation framework to comprehensively measure global and per-layer maximum activations across embedding layers, hidden states, attention blocks, and MLP/MoE modules on 27 checkpoints spanning eight model families. Leveraging a diverse 5,000-sample corpus, family-specific tokenizers, a consistent hooking mechanism, and INT-8 reconstruction error validation, they demonstrate for the first time that activation peaks are primarily governed by model family, architecture, and training stageβ€”not merely parameter count. Notably, MoE models exhibit significantly lower peaks than dense counterparts of comparable size, and residual streams often carry the global maximum. Activation spans vary by nearly four orders of magnitude at similar scales (e.g., Gemma3-27B-it reaches ~7Γ—10⁡), directly impacting quantization efficacy, prompting the recommendation to report this metric alongside model releases.
πŸ“ Abstract
The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.
Problem

Research questions and friction points this paper is trying to address.

maximum activations
large language models
low-bit quantization
model deployment
activation scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

maximum activation
low-bit quantization
open LLMs
MoE architecture
activation scaling
πŸ”Ž Similar Papers
No similar papers found.