ECO: Quantized Training without Full-Precision Master Weights

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the substantial memory overhead in large language model quantization training, which typically relies on high-precision master weight buffers—a critical bottleneck especially in sparse mixture-of-experts (SMoE) architectures. The authors propose the Error-Compensated Optimizer (ECO), the first method to enable stable quantization-aware training without maintaining master weights. ECO achieves this by injecting quantization error into the optimizer’s momentum, establishing an error-feedback mechanism that incurs no additional memory cost. Theoretical analysis demonstrates its superior convergence properties compared to naive quantization. Empirically, under FP8 and INT4 quantization, ECO achieves near-lossless accuracy across pretraining and fine-tuning on models ranging from 30M to 16B parameters, significantly reducing static memory footprint while matching the performance of master-weight baselines, thereby advancing the Pareto frontier between memory efficiency and validation loss.

Technology Category

Application Category

📝 Abstract

Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.

Problem

Research questions and friction points this paper is trying to address.

quantization

master weights

memory overhead

Large Language Model

Sparse Mixture of Experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

quantized training

master weights elimination

error feedback