🤖 AI Summary
To address the GPU memory bottleneck caused by optimizer states in large language model (LLM) training, this work identifies that gradient energy dynamically concentrates in a low-dimensional residual subspace and that the local loss landscape exhibits near-flat curvature. Leveraging these geometric insights, we propose GrassManifold— the first geometry-aware optimization framework based on random subspace projection—and introduce two novel algorithms: GrassWalk and GrassJump. These methods preserve full gradient directionality while drastically compressing optimizer state memory. Experiments on LLaMA-1B and LLaMA-7B pretraining demonstrate state-of-the-art memory reduction—up to 68%—without sacrificing convergence speed or final model quality; notably, LAMBADA accuracy improves by +0.8%. Our approach establishes a new paradigm for memory-efficient, large-scale LLM training.
📝 Abstract
Training large language models (LLMs) is often bottlenecked by extreme memory demands, with optimizer states dominating the footprint. Recent works mitigates this cost by projecting gradients into low-dimensional subspaces using sophisticated update strategies. In this paper, we analyze the dynamics of gradient space and its underlying subspaces. We find that while a small subspace captures most gradient energy, a significant portion still resides in the residual bulk; moreover, the influence of the core subspace diminishes over time and in deeper layers. We also observe that the gradient space exhibits near-flat curvature, calling for algorithms that explicitly account for this geometry. Motivated by these insights, we introduce a suite of randomized algorithms, GrassWalk and GrassJump, which exploit subspace and achieve state-of-the-art memory savings while improving performance on LLaMA-1B and LLaMA-7B pretraining.