Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates optimization strategies for language model pretraining under data-scarce conditions, focusing on how text simplification and curriculum learning affect representation quality. We propose an LLM-generated text simplification method for data augmentation and design four complexity-aware data scheduling strategies—including low-to-high and interleaved ordering. Experiments demonstrate that incorporating simplified texts significantly improves both fine-tuning and zero-shot performance. Smaller models benefit more from low-to-high complexity ordering, whereas larger models achieve greater gains with interleaved scheduling. This study breaks from conventional repeated pretraining paradigms and provides the first systematic empirical validation that text simplification serves as an effective data augmentation technique, and that complexity-aware scheduling synergistically enhances pretraining efficiency. Our findings offer novel insights and empirical support for resource-efficient pretraining in low-data regimes.

Technology Category

Application Category

📝 Abstract
Most studies on language model pretraining focus on large datasets, leaving open questions about optimization in data-constrained settings. In such settings, the effects of training data order and of including alternative versions of the same text remain underexplored. We address this by studying curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification. We ask: (1) Does simplifying texts enhance representation quality more than reusing the original data? and (2) Does ordering data by text complexity yield better representations? To answer, we build on a pair of parallel corpora where human-written paragraphs are aligned with LLM-simplified variants, and test four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved. We analyze models' representation quality from a sample efficiency perspective via fine-tuning, as well as its zero-shot performance on linguistic knowledge, entity tracking, world knowledge, and commonsense reasoning. Our findings show that adding simplified data improves fine-tuning and zero-shot performance over a repeated-exposure baseline: smaller models benefit from low-to-high complexity, while larger models perform better with interleaved ordering.
Problem

Research questions and friction points this paper is trying to address.

Optimizing pretraining strategies for data-constrained language models
Evaluating text simplification as data augmentation for pretraining
Investigating curriculum learning via text-complexity ordering effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum learning with text-complexity ordering
Data augmentation via text simplification
Comparing four data schedules for pretraining efficiency
🔎 Similar Papers
No similar papers found.
M
Matthew Theodore Roque
Samsung R&D Institute Philippines
Dan John Velasco
Dan John Velasco
Samsung Research Philippines
Natural Language ProcessingDeep Learning