To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

253K/year

🤖 AI Summary

A fundamental tension exists between low-precision computation and lossless compression in large language model (LLM) deployment. Method: This paper identifies an exponential concentration phenomenon in generative AI model weights, revealing their inherently low-entropy nature across architectures and modalities; leveraging α-stable distribution theory, it formally establishes FP4.67 as the theoretical floating-point compression limit. Building on this, we propose ECF8—a fully lossless floating-point compression framework requiring no dequantization—integrating entropy-aware encoding with GPU-optimized decoding. Results: Evaluated on a 671B-parameter LLM and DiT models, ECF8 achieves 26.9% memory reduction and 177.1% throughput improvement while guaranteeing strict output fidelity. This work bridges theory and practice by unifying the proven compression bound with a deployable format, establishing a new paradigm for lossless low-precision computing in the FP8 era.

Technology Category

Application Category

📝 Abstract

The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $α$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.

Problem

Research questions and friction points this paper is trying to address.

Developing lossless low-precision formats for GenAI model compression

Analyzing exponent concentration phenomenon in large model weights

Achieving memory savings and throughput acceleration without output deviation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exponent concentration enables lossless FP8 compression

Entropy-aware encoding optimizes memory usage

GPU-optimized decoding accelerates throughput significantly

🔎 Similar Papers

MCNC: Manifold-Constrained Reparameterization for Neural Compression