Curriculum-Guided Layer Scaling for Language Model Pretraining

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Pretraining large language models (LLMs) incurs prohibitive computational costs due to fixed, monolithic architectures. Method: This paper proposes a curriculum-guided layer-wise scaling framework that dynamically increases model depth during training while concurrently escalating data difficulty—enabling co-evolution of model capacity and data complexity. It introduces curriculum learning principles from cognitive development into model scaling, establishing the first joint progressive mechanism for data difficulty and model depth. Key components include DistilBERT-based text difficulty classification, progressive layer stacking, and a multi-stage curriculum—from synthetic short stories to diverse Web-scale data. Contribution/Results: Experiments demonstrate superior PIQA and ARC performance over baselines at 100M parameters; at 1.2B parameters, zero-shot generalization improves significantly, especially on knowledge-intensive and reasoning tasks, validating the efficacy of structured, co-adaptive scaling.

Technology Category

Application Category

📝 Abstract

As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose Curriculum-Guided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e. gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Pretraining at the 1.2B scale, we stratify the DataComp-LM corpus with a DistilBERT-based classifier and progress from general text to highly technical or specialized content. Our results show that progressively increasing model depth alongside sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks. Altogether, our findings demonstrate that CGLS unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Improving efficiency in pretraining large language models

Synchronizing data difficulty with model growth

Enhancing generalization and zero-shot performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradually adds layers during training

Synchronizes data difficulty with model growth

Uses curriculum-based progressive layer stacking

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models