🤖 AI Summary
To address GPU memory bottlenecks in large-model training and the limitations of existing low-rank adaptation methods—such as LoRA and ReLoRA—which suffer from rigid rank constraints or susceptibility to saddle points, this paper proposes Sparse Spectral Training (SST). SST introduces a novel singular-value importance-weighted sparse vector sampling mechanism for parameter updates directly in the spectral space, enabling dynamic, sparse, and full-rank-aware optimization without imposing low-rank assumptions. It is compatible with both Euclidean and hyperbolic geometric neural networks. On OPT-125M, SST achieves 67.6% reduction in perplexity gap using only 8.3% of the rank budget. Across NLP generation, machine translation, node classification, and link prediction tasks, SST consistently outperforms LoRA and ReLoRA, attaining accuracy close to full-rank fine-tuning while significantly reducing GPU memory consumption.
📝 Abstract
The growing computational demands posed by increasingly number of neural network's parameters necessitate low-memory-consumption training approaches. Previous memory reduction techniques, such as Low-Rank Adaptation (LoRA) and ReLoRA, suffer from the limitation of low rank and saddle point issues, particularly during intensive tasks like pre-training. In this paper, we propose Sparse Spectral Training (SST), an advanced training methodology that updates all singular values and selectively updates singular vectors of network weights, thereby optimizing resource usage while closely approximating full-rank training. SST refines the training process by employing a targeted updating strategy for singular vectors, which is determined by a multinomial sampling method weighted by the significance of the singular values, ensuring both high performance and memory reduction. Through comprehensive testing on both Euclidean and hyperbolic neural networks across various tasks, including natural language generation, machine translation, node classification and link prediction, SST demonstrates its capability to outperform existing memory reduction training methods and is comparable with full-rank training in some cases. On OPT-125M, with rank equating to 8.3% of embedding dimension, SST reduces the perplexity gap to full-rank training by 67.6%, demonstrating a significant reduction of the performance loss with prevalent low-rank methods. This approach offers a strong alternative to traditional training techniques, paving the way for more efficient and scalable neural network training solutions.