🤖 AI Summary
This study investigates how weight pruning, weight quantization, and activation quantization affect the pretraining scaling laws of large language models (LLMs), aiming to establish a unified effective-parameter scaling framework. It is the first to incorporate quantization into such a framework, theoretically modeling and empirically validating parameter efficiency across varying bit-widths and sparsity levels. Results show that weight quantization substantially improves parameter efficiency, though full quantization exhibits diminishing returns at ultra-low bit-widths. Through systematic ablation and cross-configuration modeling, we derive a unified scaling formula that predicts performance under diverse compression strategies. Key contributions are: (1) demonstrating that disparate compression techniques share a common effective-parameter scaling mechanism; (2) unifying quantization and pruning within a single theoretical framework; and (3) providing a composable, predictive foundation and optimization paradigm for efficient LLM design.
📝 Abstract
We investigate how different compression techniques -- such as weight and activation quantization, and weight sparsity -- affect the scaling behavior of large language models (LLMs) during pretraining. Building on previous work showing that weight sparsity acts as a constant multiplier on model size in scaling laws, we demonstrate that this"effective parameter"scaling pattern extends to quantization as well. Specifically, we establish that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths. Our results suggest that different compression techniques can be unified under a common scaling law framework, enabling principled comparison and combination of these methods.