🤖 AI Summary
Existing Thai pretraining corpora suffer from inadequate script handling, poor cultural adaptation, and opaque construction methodologies. Method: We propose the first fully reproducible Thai corpus construction framework and publicly release Mangosteen—a high-quality Thai corpus comprising 47 billion tokens. Our approach enhances the Dolma pipeline with Thai-specific language identification, hybrid (rule- and model-based) content filtering—including detection of culturally sensitive material—and multi-source heterogeneous data integration (e.g., Wikipedia, the Royal Gazette, OCR-scanned books, and Common Crawl subtitles). This refines Common Crawl documents from 202 million to 25 million. Contribution/Results: On GPT-2 evaluation, Mangosteen yields an 8-point improvement in SEA-HELM NLG score. An 8B-parameter model trained on Mangosteen significantly outperforms SEA-LION-v3 and Llama-3.1 by ~4 percentage points on Thai benchmarks, empirically validating the critical impact of high-quality, culturally grounded data on low-resource language modeling.
📝 Abstract
Pre-training data shapes a language model's quality, but raw web text is noisy and demands careful cleaning. Existing large-scale corpora rely on English-centric or language-agnostic pipelines whose heuristics do not capture Thai script or cultural nuances, leaving risky material such as gambling content untreated. Prior Thai-specific efforts customize pipelines or build new ones, yet seldom release their data or document design choices, hindering reproducibility and raising the question of how to construct a transparent, high-quality Thai corpus. We introduce Mangosteen: a 47 billion-token Thai corpus built through a Thai-adapted Dolma pipeline that includes custom rule-based language ID, revised C4/Gopher quality filters, and Thai-trained content filters, plus curated non-web sources such as Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles. Systematic ablations using GPT-2 show the pipeline trims CommonCrawl from 202M to 25M documents while raising SEA-HELM NLG from 3 to 11; an 8B-parameter SEA-LION model continually pre-trained on Mangosteen then surpasses SEA-LION-v3 and Llama-3.1 by about four points on Thai benchmarks. We release the full pipeline code, cleaning manifests, corpus snapshot, and all checkpoints, providing a fully reproducible foundation for future Thai and regional LLM research.