MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

To address legal risks and performance trade-offs arising from indiscriminate web crawling in LLM pretraining, this paper proposes a license-first data curation paradigm to construct an open, traceable, and legally compliant large-scale pretraining corpus. Methodologically, we design a risk-mitigating data provenance framework integrating public-domain, permissively licensed, and low-risk sources; implement a multi-stage pipeline featuring license-aware filtering, dual safety/quality screening, and domain-aware mixing; and incorporate instruction tuning, reasoning-oriented data augmentation, and controllable synthetic data generation. Evaluated on models ranging from 130M to 1.7B parameters trained on 50B–300B tokens, our corpus achieves performance surpassing FineWeb-Edu and approaching DCLM, with notable gains in mathematical reasoning and code generation. This work establishes a new benchmark for lawful, transparent, and reproducible LLM training.

Technology Category

Application Category

📝 Abstract

We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

Problem

Research questions and friction points this paper is trying to address.

Creating legally safe pretraining dataset from permissive sources

Developing transparent pipeline for license-aware data filtering

Achieving competitive model performance while reducing legal risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Permissive-first data sourcing minimizes legal risks

Multi-stage pipeline filters licenses and ensures quality

Targeted instruction and reasoning data enhance model performance

🔎 Similar Papers

LawInstruct: A Resource for Studying Language Model Adaptation to the Legal Domain