Gaussian Weight Sampling for Scalable, Efficient and Stable Pseudo-Quantization Training

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses key bottlenecks in full quantization training (FQT) of large language models (LLMs), including poor training consistency and excessive computational overhead (requiring >200B tokens per case). We propose pseudo-quantized training (PQT), a novel paradigm centered on a floating-point-friendly Gaussian weight-sampling noise distribution. This enables joint optimization of FP6-weight representations with 9-bit high-precision noise and supports stochastic precision annealing. By integrating additive fake quantization and BF16-operator-compatible optimizations, PQT significantly improves training stability and efficiency. We conduct pretraining on Llama2 and GPT2 with 1B parameters over 300B tokens: PQT matches or exceeds BF16 baseline performance while reducing GPU memory footprint to just 2 bytes per parameter on A100 GPUs and cutting computational cost to 1.40% of the BF16 baseline.

Technology Category

Application Category

📝 Abstract

Ever-growing scale of large language models (LLMs) is pushing for improved efficiency, favoring fully quantized training (FQT) over BF16. While FQT accelerates training, it faces consistency challenges and requires searching over an exponential number of cases, each needing over 200B tokens to ensure stability. Pseudo-quantization training (PQT) addresses the issues of FQT, although it is not well-studied. We explore the practical implications of PQT in detail and propose a noise distribution $R$ that is floating-point (FP)-friendly, with ideal properties including stochastic precision annealing. As a result, the proposed method serves as an effective theoretical foundation for low-precision FP parameters through PQT, utilizing efficient fake quantization via an addition and subsequent FP casting. We demonstrate that Gaussian weight sampling is (1) scalable: supports low-precision FP parameters down to FP6 and high-precision noise up to 9-bit with BF16 operator. The proposed method is (2) efficient: incurring computational overhead as low as 1.40% on the A100 GPU in terms of Llama2 training tokens per second, and requiring 2 bytes per parameter in GPU memory. We demonstrate that PQT with Gaussian weight sampling is (3) stable: closely following or even surpassing performance of the BF16 baseline while pre-training GPT2 and Llama2 models with up to 1B parameters and 300B tokens.

Problem

Research questions and friction points this paper is trying to address.

Addresses consistency challenges in fully quantized training (FQT) for large language models (LLMs)

Proposes a floating-point-friendly noise distribution for pseudo-quantization training (PQT)

Ensures scalable, efficient, and stable low-precision training for models up to 1B parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

FP-friendly noise distribution for PQT

Gaussian weight sampling supports FP6 to 9-bit

Low computational overhead and memory usage

🔎 Similar Papers

No similar papers found.