Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study investigates the dependence of synthetic rewriting on data quality in continual pretraining for Portuguese, examining whether it can serve as a substitute for high-quality data curation. Using the ClassiCC-PT corpus—annotated with STEM content and educational quality scores—the authors construct 10B-token subsets of high- and low-quality text. They generate approximately 80B tokens of synthetically rewritten data in four styles using a 7B instruction-tuned model, then train 1.1B and 7B models evaluated on the PoETa V2 benchmark. The work provides the first systematic evidence in a non-English setting that synthetic rewriting acts as an amplifier—not a replacement—for data quality, with this effect intensifying at larger scales: the 7B model gains +3.4 NPM with rewritten high-quality data versus only +0.5 NPM with low-quality data, while the 1.1B model shows no significant amplification.

Technology Category

Application Category

📝 Abstract

Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.

Problem

Research questions and friction points this paper is trying to address.

synthetic rewriting

data quality

language model pretraining

Portuguese

scale dependence

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic rewriting

data quality

continued pretraining