Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality training corpora for low-resource languages—exemplified by Portuguese—are scarce, limiting cross-lingual transfer performance of large language models. Method: This paper proposes a multi-stage, industrial-scale Portuguese corpus construction pipeline built upon Common Crawl snapshots. It integrates language-specific filtering modules—including language identification, exact deduplication, fine-grained quality scoring, STEM-domain classification, and toxicity detection—to produce a 120B-token high-quality Portuguese corpus. Unlike generic web-crawled data pipelines relying on coarse heuristics, this approach tightly couples domain-aware classification with low-resource language adaptation. Contribution/Results: Continual pretraining on the curated corpus yields substantial gains across multiple downstream tasks compared to baseline models, empirically validating that language-specific, domain-enhanced data curation is critical for improving cross-lingual transferability.

Technology Category

Application Category

📝 Abstract
The performance of large language models (LLMs) is deeply influenced by the quality and composition of their training data. While much of the existing work has centered on English, there remains a gap in understanding how to construct effective training corpora for other languages. We explore scalable methods for building web-based corpora for LLMs. We apply them to build a new 120B token corpus in Portuguese that achieves competitive results to an industrial-grade corpus. Using a continual pretraining setup, we study how different data selection and preprocessing strategies affect LLM performance when transitioning a model originally trained in English to another language. Our findings demonstrate the value of language-specific filtering pipelines, including classifiers for education, science, technology, engineering, and mathematics (STEM), as well as toxic content. We show that adapting a model to the target language leads to performance improvements, reinforcing the importance of high-quality, language-specific data. While our case study focuses on Portuguese, our methods are applicable to other languages, offering insights for multilingual LLM development.
Problem

Research questions and friction points this paper is trying to address.

Building high-quality Portuguese datasets for LLMs
Studying data selection effects on multilingual model performance
Developing scalable web corpus construction methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable web corpus construction methods
Language-specific filtering pipelines for quality
Continual pretraining transitioning English to Portuguese
🔎 Similar Papers
No similar papers found.