Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the prohibitively high memory overhead of stateful optimizers (e.g., Adam) in large language model training—whose auxiliary states can reach up to twice the parameter count—this work proposes the first stable 2-bit ultra-low-precision optimizer. Our method introduces: (1) a logarithmic quantization scheme robust to signal drowning, and (2) an EMA-based dynamic modeling mechanism guided by gradient variance analysis, coupled with momentum self-adaptation to ensure convergence stability under extreme quantization. Evaluated on a 7B-parameter model, our optimizer reduces optimizer-state memory by approximately 45 GB while incurring negligible accuracy degradation. This advance substantially improves the feasibility of large-model training under severe memory constraints, enabling efficient optimization with unprecedented bit-width reduction.

Technology Category

Application Category

📝 Abstract

The explosion in model sizes leads to continued growth in prohibitive training/fine-tuning costs, particularly for stateful optimizers which maintain auxiliary information of even 2x the model size to achieve optimal convergence. We therefore present in this work a novel type of optimizer that carries with extremely lightweight state overloads, achieved through ultra-low-precision quantization. While previous efforts have achieved certain success with 8-bit or 4-bit quantization, our approach enables optimizers to operate at precision as low as 3 bits, or even 2 bits per state element. This is accomplished by identifying and addressing two critical challenges: the signal swamping problem in unsigned quantization that results in unchanged state dynamics, and the rapidly increased gradient variance in signed quantization that leads to incorrect descent directions. The theoretical analysis suggests a tailored logarithmic quantization for the former and a precision-specific momentum value for the latter. Consequently, the proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss. We hope that SOLO can contribute to overcoming the bottleneck in computational resources, thereby promoting greater accessibility in fundamental research.

Problem

Research questions and friction points this paper is trying to address.

Reducing optimizer memory overhead via ultra-low-precision quantization

Addressing signal swamping in unsigned 3-2 bit quantization

Mitigating gradient variance in signed low-bit optimizer states

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ultra-low-precision quantization at 3-2 bits

Logarithmic quantization for signal swamping

Precision-specific momentum for gradient variance

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL