Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

241K/year
🤖 AI Summary
Can low-rank pretraining achieve generalization performance comparable to full-rank training, and is perplexity alone sufficient for evaluation? This work systematically compares five low-rank methods—GaLore, Fira, CoLA, SLTrain, and ReLoRA—against full-rank training across three model scales, employing 16 geometric and spectral metrics including PCA-based loss landscape profiles, checkpoint interpolation, weight spectrum analysis, and activation similarity. For the first time, it reveals fundamental differences between low-rank and full-rank solution spaces in terms of loss landscape geometry, spectral properties of weights, and activation representations: low-rank methods converge to distinct optimization basins that differ from full-rank solutions, exhibit significant deviations in later-layer activations, and show inconsistent alignment between perplexity and downstream task performance. Incorporating geometric and spectral indicators yields more accurate performance prediction, challenging the prevailing perplexity-centric evaluation paradigm.
📝 Abstract
Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.
Problem

Research questions and friction points this paper is trying to address.

low-rank pre-training
generalization
solution quality
validation perplexity
loss landscape
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank pre-training
loss landscape geometry
spectral analysis
activation similarity
model generalization
🔎 Similar Papers
No similar papers found.