🤖 AI Summary
This work addresses the challenge of post-training model compression at extremely low bitrates (<4 bits), where existing data-free methods often suffer from catastrophic performance degradation, while calibration-based approaches incur high computational costs and are sensitive to distribution shifts. The authors propose EntQuant, a novel framework that uniquely integrates entropy coding with floating-point quantization (e.g., Float8@2bit), decoupling numerical precision from storage cost to enable efficient, calibration-free compression. EntQuant achieves both the universality of data-free methods and the high fidelity typically associated with data-dependent techniques. It compresses 70B-parameter models within 30 minutes while attaining state-of-the-art accuracy and preserving strong functional capabilities on complex instruction-tuned models, all with manageable inference overhead.
📝 Abstract
Post-training compression is currently divided into two contrasting regimes. On the one hand, fast, data-free, and model-agnostic methods (e.g., NF4 or HQQ) offer maximum accessibility but suffer from functional collapse at extreme bit-rates below 4 bits. On the other hand, techniques leveraging calibration data or extensive recovery training achieve superior fidelity but impose high computational constraints and face uncertain robustness under data distribution shifts. We introduce EntQuant, the first framework to unite the advantages of these distinct paradigms. By matching the performance of data-dependent methods with the speed and universality of data-free techniques, EntQuant enables practical utility in the extreme compression regime. Our method decouples numerical precision from storage cost via entropy coding, compressing a 70B parameter model in less than 30 minutes. We demonstrate that EntQuant does not only achieve state-of-the-art results on standard evaluation sets and models, but also retains functional performance on more complex benchmarks with instruction-tuned models, all at modest inference overhead.