🤖 AI Summary
High-quality training corpora for low-resource languages—exemplified by Portuguese—are scarce, limiting cross-lingual transfer performance of large language models. Method: This paper proposes a multi-stage, industrial-scale Portuguese corpus construction pipeline built upon Common Crawl snapshots. It integrates language-specific filtering modules—including language identification, exact deduplication, fine-grained quality scoring, STEM-domain classification, and toxicity detection—to produce a 120B-token high-quality Portuguese corpus. Unlike generic web-crawled data pipelines relying on coarse heuristics, this approach tightly couples domain-aware classification with low-resource language adaptation. Contribution/Results: Continual pretraining on the curated corpus yields substantial gains across multiple downstream tasks compared to baseline models, empirically validating that language-specific, domain-enhanced data curation is critical for improving cross-lingual transferability.
📝 Abstract
The performance of large language models (LLMs) is deeply influenced by the quality and composition of their training data. While much of the existing work has centered on English, there remains a gap in understanding how to construct effective training corpora for other languages. We explore scalable methods for building web-based corpora for LLMs. We apply them to build a new 120B token corpus in Portuguese that achieves competitive results to an industrial-grade corpus. Using a continual pretraining setup, we study how different data selection and preprocessing strategies affect LLM performance when transitioning a model originally trained in English to another language. Our findings demonstrate the value of language-specific filtering pipelines, including classifiers for education, science, technology, engineering, and mathematics (STEM), as well as toxic content. We show that adapting a model to the target language leads to performance improvements, reinforcing the importance of high-quality, language-specific data. While our case study focuses on Portuguese, our methods are applicable to other languages, offering insights for multilingual LLM development.