SkyLadder: Better and Faster Pretraining via Context Window Scheduling

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Balancing long-context capability and training efficiency remains challenging in large language model (LLM) pretraining. Method: This paper proposes a short-to-long dynamic context window scheduling strategy. We first theoretically and empirically establish that, under a fixed token budget, pretraining with shorter contexts yields superior efficiency. Building on this insight, we design a progressive window expansion mechanism that significantly enhances long-text modeling capability without degrading performance on standard benchmarks. The approach is fully compatible with mainstream Transformer architectures and data loading pipelines, requiring no modifications to model structure or training objectives. Results: Trained on 100B tokens, our 1B/3B models achieve an average 3.7% improvement across benchmarks and up to 22% training speedup. Our core contribution is the identification of the fundamental trade-off between context length and pretraining efficiency, and the establishment of the first efficient, general-purpose, plug-and-play dynamic window scheduling framework.

Technology Category

Application Category

📝 Abstract
Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.
Problem

Research questions and friction points this paper is trying to address.

Optimize context window scheduling for LLM pretraining efficiency.
Balance long-context capability with pretraining performance.
Achieve faster training speeds without sacrificing benchmark performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Short-to-long context window transition strategy
Improved pretraining efficiency and benchmark performance
Faster training speeds with consistent performance gains
Tongyao Zhu
Tongyao Zhu
National University of Singapore
Natural Language Processing
Q
Qian Liu
Sea AI Lab
H
Haonan Wang
National University of Singapore
S
Shiqi Chen
City University of Hong Kong
Xiangming Gu
Xiangming Gu
National University of Singapore
Machine LearningLarge Language ModelsGenerative Models
T
Tianyu Pang
Sea AI Lab
M
Min-Yen Kan
National University of Singapore