Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current open large language models suffer from two systemic deficiencies: inadequate data compliance and severe underrepresentation of low-resource languages. To address these, we introduce the first fully open-source multilingual LLM series covering over 1,800 languages. Our method employs a novel Goldfish pretraining objective to suppress verbatim memorization, integrated with strict robots.txt adherence, copyright-compliant filtering, and automated toxicity and PII detection—enabling end-to-end reproducible and transparent data governance. Trained on a 15-teratoken multilingual corpus, the series comprises 8B- and 70B-parameter models. On multilingual benchmarks, both variants achieve state-of-the-art performance among open-weight models—matching or surpassing comparable open models in accuracy and robustness. All training code, data preprocessing scripts, and evaluation tooling are publicly released under permissive open-source licenses.

Technology Category

Application Category

📝 Abstract
We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.
Problem

Research questions and friction points this paper is trying to address.

Addressing data compliance and content filtering in open LLMs
Expanding multilingual representation with broad language coverage
Mitigating memorization risks while maintaining task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Openly available compliant data training
Goldfish objective suppressing verbatim recall
Multilingual training on 1800+ languages