FRAMES: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy

📅 2025-02-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current large language model (LLM) pretraining lacks quantitative principles for data organization, relying heavily on empirical heuristics, thereby limiting performance gains. Method: This paper proposes a theoretically grounded, four-stage progressive pretraining paradigm guided by a dual-dimensional, quadrant-based data partitioning principle—leveraging perplexity (PPL) and perplexity difference (PD)—to enable fully quantified coordination among data evaluation, dynamic partitioning, and training scheduling. Contribution/Results: Our approach eliminates reliance on experience-driven multi-stage strategies and guarantees statistically significant (four-fold) reductions in training loss across all stages. Evaluated on MMLU and CMMLU benchmarks, it achieves an average improvement of 16.8%, substantially outperforming both random sampling and state-of-the-art multi-stage baselines.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have significantly advanced human language understanding and generation, with pretraining data quality and organization being crucial to their performance. Multi-stage pretraining is a promising approach, but existing methods often lack quantitative criteria for data partitioning and instead rely on intuitive heuristics. In this paper, we propose the novel Four-quadRAnt Multi-stage prEtraining Strategy (FRAMES), guided by the established principle of organizing the pretraining process into four stages to achieve significant loss reductions four times. This principle is grounded in two key findings: first, training on high Perplexity (PPL) data followed by low PPL data, and second, training on low PPL difference (PD) data followed by high PD data, both causing the loss to drop significantly twice and performance enhancements. By partitioning data into four quadrants and strategically organizing them, FRAMES achieves a remarkable 16.8% average improvement over random sampling across MMLU and CMMLU, effectively boosting LLM performance.

Problem

Research questions and friction points this paper is trying to address.

Enhance large language model performance

Optimize multi-stage pretraining strategy

Quantify data partitioning for pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Four-quadrant multi-stage pretraining

Data partitioning by Perplexity levels

Strategic organization for loss reduction

🔎 Similar Papers

Checkpoint Merging via Bayesian Optimization in LLM Pretraining