🤖 AI Summary
This work systematically investigates, for the first time, the applicability of ReLoRA to pretraining small language models (SLMs), addressing the open question of whether low-rank adaptation methods can be directly transferred from fine-tuning to pretraining in resource-constrained settings. Combining ReLoRA with LoRA and other parameter-efficient fine-tuning techniques, the study conducts ablation experiments and learning dynamics analysis—including loss trajectories, perplexity, and grammar task performance—to evaluate efficacy under limited compute. Results show that ReLoRA consistently underperforms full-parameter pretraining across SLMs, with degradation worsening as model size increases. The root cause is identified as heightened rank deficiency in smaller models, which impedes low-rank updates from capturing the broad representational shifts required during pretraining. This work reveals an intrinsic limitation of low-rank methods in the pretraining phase and provides critical empirical evidence and theoretical caution for designing efficient pretraining strategies.
📝 Abstract
Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime.