Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the high memory overhead and slow convergence of optimizers for large language models (LLMs), this paper proposes a systematic design paradigm based on the Frobenius-norm approximation of the structured Fisher Information Matrix (FIM). It unifies mainstream efficient optimizers as structured low-rank approximations of the FIM—first establishing principled structural selection criteria and a low-rank expansion framework, from which two novel optimizers, RACS and Alice, are derived. RACS achieves state-of-the-art performance with memory overhead nearly matching that of SGD; Alice accelerates convergence by over 2× compared to Adam in pretraining LLaMA-family models up to 1B parameters. The method integrates structured matrix approximation, FIM-based geometric modeling, gradient covariance estimation, and adaptive scaling—balancing theoretical rigor with engineering practicality.

Technology Category

Application Category

📝 Abstract

Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.

Problem

Research questions and friction points this paper is trying to address.

Design efficient optimizers for large language models

Improve memory efficiency with low-rank extension

Achieve faster convergence with structured Fisher approximation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Fisher information matrix

Low-rank extension framework

Memory-efficient optimizers design

🔎 Similar Papers

Adaptive Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization