Randomized Gradient Subspaces for Efficient Large Language Model Training

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the GPU memory bottleneck caused by optimizer states in large language model (LLM) training, this work identifies that gradient energy dynamically concentrates in a low-dimensional residual subspace and that the local loss landscape exhibits near-flat curvature. Leveraging these geometric insights, we propose GrassManifold— the first geometry-aware optimization framework based on random subspace projection—and introduce two novel algorithms: GrassWalk and GrassJump. These methods preserve full gradient directionality while drastically compressing optimizer state memory. Experiments on LLaMA-1B and LLaMA-7B pretraining demonstrate state-of-the-art memory reduction—up to 68%—without sacrificing convergence speed or final model quality; notably, LAMBADA accuracy improves by +0.8%. Our approach establishes a new paradigm for memory-efficient, large-scale LLM training.

Technology Category

Application Category

📝 Abstract
Training large language models (LLMs) is often bottlenecked by extreme memory demands, with optimizer states dominating the footprint. Recent works mitigates this cost by projecting gradients into low-dimensional subspaces using sophisticated update strategies. In this paper, we analyze the dynamics of gradient space and its underlying subspaces. We find that while a small subspace captures most gradient energy, a significant portion still resides in the residual bulk; moreover, the influence of the core subspace diminishes over time and in deeper layers. We also observe that the gradient space exhibits near-flat curvature, calling for algorithms that explicitly account for this geometry. Motivated by these insights, we introduce a suite of randomized algorithms, GrassWalk and GrassJump, which exploit subspace and achieve state-of-the-art memory savings while improving performance on LLaMA-1B and LLaMA-7B pretraining.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory bottlenecks in large language model training
Addresses optimizer state dominance in memory footprint
Improves training efficiency through randomized subspace algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomized algorithms exploit gradient subspaces efficiently
GrassWalk and GrassJump achieve state-of-the-art memory savings
Methods improve performance in LLaMA pretraining scenarios
🔎 Similar Papers
No similar papers found.
Sahar Rajabi
Sahar Rajabi
Researcher
Machine LearningOptimizationInterpretabilityNLPMultimodal Learning
N
Nayeema Nonta
Department of Management Science and Engineering, University of Waterloo
S
Samanvay Vajpayee
Department of Computer Science, University of Toronto
Sirisha Rambhatla
Sirisha Rambhatla
Assistant Professor at the University of Waterloo
Machine LearningStatistical Signal ProcessingOptimizationAI for Healthcare