EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In language model fine-tuning, stochasticity induced by small-batch training causes severe fluctuations in generation quality; while standard Exponential Moving Average (EMA) of weights improves training smoothness, it introduces optimization lag due to historical weight accumulation. To address this, we propose Bias-Corrected EMA (BEMA), a theoretically grounded variant that eliminates iteration lag while preserving EMA’s variance-reduction capability via an explicit bias-correction mechanism. We establish a convergence analysis framework for BEMA and prove its superior theoretical convergence rate over both standard EMA and vanilla SGD. Empirically, BEMA consistently enhances training stability, accelerates convergence, and achieves higher final performance across multiple mainstream language model fine-tuning benchmarks—demonstrating both effectiveness and practical utility.

Technology Category

Application Category

📝 Abstract
Stochasticity in language model fine-tuning, often caused by the small batch sizes typically used in this regime, can destabilize training by introducing large oscillations in generation quality. A popular approach to mitigating this instability is to take an Exponential moving average (EMA) of weights throughout training. While EMA reduces stochasticity, thereby smoothing training, the introduction of bias from old iterates often creates a lag in optimization relative to vanilla training. In this work, we propose the Bias-Corrected Exponential Moving Average (BEMA), a simple and practical augmentation of EMA that retains variance-reduction benefits while eliminating bias. BEMA is motivated by a simple theoretical model wherein we demonstrate provable acceleration of BEMA over both a standard EMA and vanilla training. Through an extensive suite of experiments on Language Models, we show that BEMA leads to significantly improved convergence rates and final performance over both EMA and vanilla training in a variety of standard LM benchmarks, making BEMA a practical and theoretically motivated intervention for more stable and efficient fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Mitigates training instability from small batch sizes
Reduces bias lag in exponential moving averages
Improves convergence rates in language model fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bias-Corrected Exponential Moving Average (BEMA)
Eliminates bias in EMA while reducing variance
Accelerates convergence in language model fine-tuning
🔎 Similar Papers
No similar papers found.