Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

There is a lack of open, reproducible benchmarks for systematically evaluating language model training methods across scales and datasets. Method: We train a series of dense Transformer-based language models (0.13B–1.7B parameters), pretraining each on eight open-source datasets up to 1T tokens, and release full training logs, intermediate checkpoints, and downstream evaluation tooling. We propose a “Unified Computational Axis Scaling Framework” to model training dynamics and enable method alignment analysis. Contribution/Results: Our benchmark establishes the first standardized, transparent, and reproducible training baseline and evaluation protocol. Empirical analysis reveals NemoTron-CC HQ as the top-performing dataset across benchmark tasks—yielding the first empirically grounded ranking of dataset efficacy. The framework supports rigorous, cross-scale comparison of training methodologies, facilitating community-wide progress in efficient and robust LLM training.

Technology Category

Application Category

📝 Abstract

We introduce open-sci-ref, a family of dense transformer models trained as research baselines across multiple model (0.13B to 1.7B parameters) and token scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on various standardized benchmarks, our training runs set establishes reference points that enable researchers to assess the sanity and quality of alternative training approaches across scales and datasets. Intermediate checkpoints allow comparison and studying of the training dynamics. The established reference baselines allow training procedures to be compared through their scaling trends, aligning them on a common compute axis. Comparison of open reference datasets reveals that training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to intermediate training checkpoints, the release includes logs, code, and downstream evaluations to simplify reproduction, standardize comparison, and facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Establishing reproducible reference baselines for model comparison

Evaluating training approaches across multiple scales and datasets

Comparing performance of different open reference datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense transformer models trained as baselines

Multiple model and token scales comparison

Training checkpoints enable dynamic analysis

🔎 Similar Papers

OLMES: A Standard for Language Model Evaluations