GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the GPU memory bottleneck caused by gradient storage during large language model (LLM) pretraining, this paper proposes GaLore 2—a highly efficient and scalable gradient low-rank projection framework. Unlike the original GaLore, which relies on frequent singular value decomposition (SVD) to compute subspaces, GaLore 2 introduces a novel SVD-free dynamic subspace update mechanism. It is deeply integrated with state-of-the-art parallelization strategies—including Fully Sharded Data Parallel (FSDP)—and natively supports low-bit quantization and higher-order tensor decomposition. Evaluated on a 500B-token pretraining scale, GaLore 2 reduces peak GPU memory consumption by up to 42% compared to baseline methods, improves training throughput, and enables full pretraining of Llama-7B from scratch. The resulting model achieves performance on par with its full-precision counterpart, demonstrating both scalability and fidelity.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients, enabling substantial memory savings without sacrificing performance. Recent works further extend GaLore from various aspects, including low-bit quantization and higher-order tensor structures. However, there are several remaining challenges for GaLore, such as the computational overhead of SVD for subspace updates and the integration with state-of-the-art training parallelization strategies (e.g., FSDP). In this paper, we present GaLore 2, an efficient and scalable GaLore framework that addresses these challenges and incorporates recent advancements. In addition, we demonstrate the scalability of GaLore 2 by pre-training Llama 7B from scratch using up to 500 billion training tokens, highlighting its potential impact on real LLM pre-training scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses memory bottlenecks in large-scale LLM pre-training

Reduces computational overhead of SVD for subspace updates

Integrates with advanced training parallelization strategies like FSDP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages low-rank gradient structure for memory efficiency

Integrates low-bit quantization and tensor optimizations

Enables scalable LLM pre-training with reduced overhead

🔎 Similar Papers

No similar papers found.

Authors to Follow