LOST: Low-rank and Sparse Pre-training for Large Language Models

πŸ“… 2025-08-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the prohibitive computational and memory overhead of training large language models (LLMs) from scratch, this paper proposes LOSTβ€”a novel method that jointly models low-rank and sparse structures for efficient pretraining. LOST decomposes model weights via singular value decomposition (SVD), using dominant singular vectors as a low-rank basis, while incorporating channel-wise sparse residual terms; both components are optimized end-to-end. This co-design mitigates information loss inherent in conventional low-rank approximations, preserving representational capacity while improving efficiency. LOST enables scalable pretraining across model sizes ranging from 60M to 7B parameters. Empirically, it matches or surpasses full-rank baselines across multi-scale downstream tasks, reduces GPU memory consumption by 38%–52%, and cuts FLOPs by 29%–47%. The implementation is publicly available.

Technology Category

Application Category

πŸ“ Abstract
While large language models (LLMs) have achieved remarkable performance across a wide range of tasks, their massive scale incurs prohibitive computational and memory costs for pre-training from scratch. Recent studies have investigated the use of low-rank parameterization as a means of reducing model size and training cost. In this context, sparsity is often employed as a complementary technique to recover important information lost in low-rank compression by capturing salient features in the residual space. However, existing approaches typically combine low-rank and sparse components in a simplistic or ad hoc manner, often resulting in undesirable performance degradation compared to full-rank training. In this paper, we propose extbf{LO}w-rank and extbf{S}parse pre- extbf{T}raining ( extbf{LOST}) for LLMs, a novel method that ingeniously integrates low-rank and sparse structures to enable effective training of LLMs from scratch under strict efficiency constraints. LOST applies singular value decomposition to weight matrices, preserving the dominant low-rank components, while allocating the remaining singular values to construct channel-wise sparse components to complement the expressiveness of low-rank training. We evaluate LOST on LLM pretraining ranging from 60M to 7B parameters. Our experiments show that LOST achieves competitive or superior performance compared to full-rank models, while significantly reducing both memory and compute overhead. Moreover, Code is available at href{https://github.com/JiaxiLi1/LOST-Low-rank-and-Sparse-Training-for-Large-Language-Models}{LOST Repo}
Problem

Research questions and friction points this paper is trying to address.

Reducing computational and memory costs in LLM pre-training
Integrating low-rank and sparse structures effectively
Maintaining performance while lowering training overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates low-rank and sparse structures effectively
Uses SVD for dominant low-rank components
Constructs channel-wise sparse components complementarily
πŸ”Ž Similar Papers
No similar papers found.
J
Jiaxi Li
University of Surrey
L
Lu Yin
University of Surrey
L
Li Shen
Sun Yat-sen University
J
Jinjin Xu
Bytedance
L
Liwu Xu
Alibaba Group
Tianjin Huang
Tianjin Huang
Asst. Professor, CS@University of Exeter & Researcher Fellow, CS@TU/e
LLMsAdversarial examplesStable TrainingGraph Neural NetworkSparse Training
Wenwu Wang
Wenwu Wang
Professor, University of Surrey, UK
signal processingmachine learningmachine listeningaudio/speech/audio-visualmultimodal fusion
S
Shiwei Liu
University of Oxford
X
Xilu Wang
University of Surrey