🤖 AI Summary
There is a lack of open, reproducible benchmarks for systematically evaluating language model training methods across scales and datasets. Method: We train a series of dense Transformer-based language models (0.13B–1.7B parameters), pretraining each on eight open-source datasets up to 1T tokens, and release full training logs, intermediate checkpoints, and downstream evaluation tooling. We propose a “Unified Computational Axis Scaling Framework” to model training dynamics and enable method alignment analysis. Contribution/Results: Our benchmark establishes the first standardized, transparent, and reproducible training baseline and evaluation protocol. Empirical analysis reveals NemoTron-CC HQ as the top-performing dataset across benchmark tasks—yielding the first empirically grounded ranking of dataset efficacy. The framework supports rigorous, cross-scale comparison of training methodologies, facilitating community-wide progress in efficient and robust LLM training.
📝 Abstract
We introduce open-sci-ref, a family of dense transformer models trained as research baselines across multiple model (0.13B to 1.7B parameters) and token scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on various standardized benchmarks, our training runs set establishes reference points that enable researchers to assess the sanity and quality of alternative training approaches across scales and datasets. Intermediate checkpoints allow comparison and studying of the training dynamics. The established reference baselines allow training procedures to be compared through their scaling trends, aligning them on a common compute axis. Comparison of open reference datasets reveals that training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to intermediate training checkpoints, the release includes logs, code, and downstream evaluations to simplify reproduction, standardize comparison, and facilitate future research.