🤖 AI Summary
This study investigates the dependence of synthetic rewriting on data quality in continual pretraining for Portuguese, examining whether it can serve as a substitute for high-quality data curation. Using the ClassiCC-PT corpus—annotated with STEM content and educational quality scores—the authors construct 10B-token subsets of high- and low-quality text. They generate approximately 80B tokens of synthetically rewritten data in four styles using a 7B instruction-tuned model, then train 1.1B and 7B models evaluated on the PoETa V2 benchmark. The work provides the first systematic evidence in a non-English setting that synthetic rewriting acts as an amplifier—not a replacement—for data quality, with this effect intensifying at larger scales: the 7B model gains +3.4 NPM with rewritten high-quality data versus only +0.5 NPM with low-quality data, while the 1.1B model shows no significant amplification.
📝 Abstract
Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.