Randomized Gradient Subspaces for Efficient Large Language Model Training

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the GPU memory bottleneck caused by optimizer states in large language model (LLM) training, this work identifies that gradient energy dynamically concentrates in a low-dimensional residual subspace and that the local loss landscape exhibits near-flat curvature. Leveraging these geometric insights, we propose GrassManifold— the first geometry-aware optimization framework based on random subspace projection—and introduce two novel algorithms: GrassWalk and GrassJump. These methods preserve full gradient directionality while drastically compressing optimizer state memory. Experiments on LLaMA-1B and LLaMA-7B pretraining demonstrate state-of-the-art memory reduction—up to 68%—without sacrificing convergence speed or final model quality; notably, LAMBADA accuracy improves by +0.8%. Our approach establishes a new paradigm for memory-efficient, large-scale LLM training.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) is often bottlenecked by extreme memory demands, with optimizer states dominating the footprint. Recent works mitigates this cost by projecting gradients into low-dimensional subspaces using sophisticated update strategies. In this paper, we analyze the dynamics of gradient space and its underlying subspaces. We find that while a small subspace captures most gradient energy, a significant portion still resides in the residual bulk; moreover, the influence of the core subspace diminishes over time and in deeper layers. We also observe that the gradient space exhibits near-flat curvature, calling for algorithms that explicitly account for this geometry. Motivated by these insights, we introduce a suite of randomized algorithms, GrassWalk and GrassJump, which exploit subspace and achieve state-of-the-art memory savings while improving performance on LLaMA-1B and LLaMA-7B pretraining.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory bottlenecks in large language model training

Addresses optimizer state dominance in memory footprint

Improves training efficiency through randomized subspace algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomized algorithms exploit gradient subspaces efficiently

GrassWalk and GrassJump achieve state-of-the-art memory savings

Methods improve performance in LLaMA pretraining scenarios

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models