A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the dual memory bottlenecks—activations and optimizer states—in large language model (LLM) training, especially under long-context and large-batch regimes, this paper proposes the first stochastic subspace optimization framework tailored for LLM training. In each iteration, parameter updates are performed within a randomly sampled low-dimensional subspace, simultaneously compressing both activation and optimizer memory footprints. The method is compatible with Adam and GaLore architectures, supports diverse subproblem solvers, and provides rigorous theoretical convergence analysis with provable convergence rates. Experiments demonstrate that it matches Adam’s performance in both pretraining and fine-tuning, while significantly reducing GPU memory consumption and inter-GPU communication overhead compared to GaLore. Its convergence stability is validated both theoretically and empirically.

Technology Category

Application Category

📝 Abstract

The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.

Problem

Research questions and friction points this paper is trying to address.

Addresses memory challenges in training Large Language Models.

Reduces memory usage for activations and optimizer states.

Provides convergence guarantees for randomized subspace optimization.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomized Subspace Optimization reduces memory usage

Decomposes high-dimensional training into lower-dimensional subproblems

Ensures convergence guarantees for various optimization strategies

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models