FOCUS: First Order Concentrated Updating Scheme

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Gradient noise during large language model (LLM) pretraining impedes optimization stability and slows convergence. Method: We propose a novel first-order optimizer built upon the Signum framework, introducing— for the first time—a parameter moving-average attraction mechanism that preserves large effective step sizes while substantially enhancing convergence stability under high gradient noise. Contribution/Results: Our theoretical analysis is the first to rigorously characterize the critical limiting role of gradient noise in LLM pretraining. Empirical evaluation on synthetic loss landscapes confirms superior noise robustness. On GPT-2 pretraining, our method achieves faster convergence than Adam and greater training stability than Signum. This work establishes a new optimization paradigm for efficient LLM pretraining, grounded in rigorous theory and validated by practical performance gains.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) demonstrate remarkable performance, and improving their pre-training process appears to be key to enhancing their capabilities further. Based on the documented success of Adam, learning rate decay, and weight decay, we hypothesize that the pre-training loss landscape features a narrowing valley structure. Through experiments with synthetic loss functions, we discover that when gradient query noise is high relative to the valley's sharpness, Adam's performance falls behind that of Signum because Adam reduces the effective step size too drastically. This observation led us to develop FOCUS, an optimizer that enhances Signum by incorporating attraction toward moving averaged parameters, allowing it to handle noise better while maintaining larger step sizes. In training GPT-2, FOCUS proves to be more stable than Signum and faster than Adam. These results suggest that gradient noise may be an underappreciated limiting factor in LLM training, and FOCUS offers promising solutions.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Optimizer Efficiency
Training Stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

FOCUS optimizer
noise resistance
LLMs pre-training efficiency
🔎 Similar Papers
No similar papers found.