🤖 AI Summary
This work investigates optimization strategies for language model pretraining under data-scarce conditions, focusing on how text simplification and curriculum learning affect representation quality. We propose an LLM-generated text simplification method for data augmentation and design four complexity-aware data scheduling strategies—including low-to-high and interleaved ordering. Experiments demonstrate that incorporating simplified texts significantly improves both fine-tuning and zero-shot performance. Smaller models benefit more from low-to-high complexity ordering, whereas larger models achieve greater gains with interleaved scheduling. This study breaks from conventional repeated pretraining paradigms and provides the first systematic empirical validation that text simplification serves as an effective data augmentation technique, and that complexity-aware scheduling synergistically enhances pretraining efficiency. Our findings offer novel insights and empirical support for resource-efficient pretraining in low-data regimes.
📝 Abstract
Most studies on language model pretraining focus on large datasets, leaving open questions about optimization in data-constrained settings. In such settings, the effects of training data order and of including alternative versions of the same text remain underexplored. We address this by studying curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification. We ask: (1) Does simplifying texts enhance representation quality more than reusing the original data? and (2) Does ordering data by text complexity yield better representations? To answer, we build on a pair of parallel corpora where human-written paragraphs are aligned with LLM-simplified variants, and test four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved. We analyze models' representation quality from a sample efficiency perspective via fine-tuning, as well as its zero-shot performance on linguistic knowledge, entity tracking, world knowledge, and commonsense reasoning. Our findings show that adding simplified data improves fine-tuning and zero-shot performance over a repeated-exposure baseline: smaller models benefit from low-to-high complexity, while larger models perform better with interleaved ordering.