🤖 AI Summary
A fundamental tension exists between low-precision computation and lossless compression in large language model (LLM) deployment. Method: This paper identifies an exponential concentration phenomenon in generative AI model weights, revealing their inherently low-entropy nature across architectures and modalities; leveraging α-stable distribution theory, it formally establishes FP4.67 as the theoretical floating-point compression limit. Building on this, we propose ECF8—a fully lossless floating-point compression framework requiring no dequantization—integrating entropy-aware encoding with GPU-optimized decoding. Results: Evaluated on a 671B-parameter LLM and DiT models, ECF8 achieves 26.9% memory reduction and 177.1% throughput improvement while guaranteeing strict output fidelity. This work bridges theory and practice by unifying the proven compression bound with a deployable format, establishing a new paradigm for lossless low-precision computing in the FP8 era.
📝 Abstract
The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $α$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.