PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address estimation bias, high variance, and optimizer state misalignment caused by low-rank gradient compression in LLM training, this paper proposes the first unbiased minimum-variance low-rank gradient estimation framework. Methodologically: (1) it employs probabilistic low-rank projection with variance-minimization constraints to achieve unbiased, low-variance gradient estimation; (2) it introduces a hyperparameter-free dynamic alignment mechanism for optimizer first- and second-moment states, resolving subspace-induced state mismatch. Experiments demonstrate that the method reduces evaluation loss by 33% in pretraining and lowers average GLUE training loss by 28%, while maintaining memory and computational overhead comparable to GaLoRE. The framework is plug-and-play and deployable on consumer-grade hardware.

Technology Category

Application Category

📝 Abstract

Accelerator memory and networking constraints have emerged as dominant bottlenecks when training large language models LLMs with billions of parameters. Existing low rank gradient estimators such as GaLoRE and FLORA compress gradients and optimizer tensors by projecting weight gradients onto a rank r subspace, enabling LLM training on consumer hardware. Yet, these methods are either biased or subject to high estimator variance. Moreover, the optimizer state based on the first and second moments estimates expressed in the previous subspace becomes misaligned whenever the projection is updated, leading to instabilities during training. We propose PLUMAGE: Probabilistic Low rank Unbiased Minimum vAriance Gradient Estimator. PLUMAGE is a drop in replacement for existing low rank gradient estimators. It does not introduce new hyperparameters beyond the chosen rank r and the update interval. In addition, we resolve optimizer state misalignment issues to prevent spurious weight updates and enhance training stability. We empirically demonstrate that PLUMAGE shrinks the full rank optimization's gap over the pre training evaluation loss by 33% on average across models and the average training loss across the GLUE benchmark by 28% within a similar computational and memory footprint as GaloRE.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory and networking bottlenecks in large model training

Addresses bias and high variance in low-rank gradient estimators

Resolves optimizer state misalignment to enhance training stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unbiased low-rank gradient estimator for LLMs

Resolves optimizer state misalignment issues

Maintains computational efficiency with minimal hyperparameters

🔎 Similar Papers

Multiple importance sampling for stochastic gradient estimation