NeurIPS 2025 E2LM Competition : Early Training Evaluation of Language Models

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluation benchmarks lack discriminative power during early training stages (<200B tokens), hindering effective assessment of scientific knowledge acquisition in small models (0.5B/1B/3B). Method: We propose the first scientific knowledge evaluation framework tailored for early-stage training—built upon dynamically sampled checkpoints to construct lightweight, reproducible, domain-focused assessment tasks; integrating knowledge tracing, difficulty-adaptive sampling, and cross-model ranking stability analysis. The framework is designed for low-resource execution (free cloud GPUs) while maintaining ranking consistency at trillion-token scale. Contribution/Results: We open-source the first early-training evaluation benchmark and deliver multiple high-discriminative, highly consistent, and scientifically grounded evaluation protocols. These enable fine-grained, interpretable monitoring of scientific reasoning development throughout LLM training, advancing both process transparency and scientific capability modeling.

Technology Category

Application Category

📝 Abstract
Existing benchmarks have proven effective for assessing the performance of fully trained large language models. However, we find striking differences in the early training stages of small models, where benchmarks often fail to provide meaningful or discriminative signals. To explore how these differences arise, this competition tackles the challenge of designing scientific knowledge evaluation tasks specifically tailored for measuring early training progress of language models. Participants are invited to develop novel evaluation methodologies or adapt existing benchmarks to better capture performance differences among language models. To support this effort, we provide three pre-trained small models (0.5B, 1B, and 3B parameters), along with intermediate checkpoints sampled during training up to 200B tokens. All experiments and development work can be run on widely available free cloud-based GPU platforms, making participation accessible to researchers with limited computational resources. Submissions will be evaluated based on three criteria: the quality of the performance signal they produce, the consistency of model rankings at 1 trillion tokens of training, and their relevance to the scientific knowledge domain. By promoting the design of tailored evaluation strategies for early training, this competition aims to attract a broad range of participants from various disciplines, including those who may not be machine learning experts or have access to dedicated GPU resources. Ultimately, this initiative seeks to make foundational LLM research more systematic and benchmark-informed from the earliest phases of model development.
Problem

Research questions and friction points this paper is trying to address.

Design evaluation tasks for early training of small language models
Adapt benchmarks to measure early performance differences effectively
Enable accessible participation with free cloud-based GPU resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Design early training evaluation tasks
Provide pre-trained small models
Use free cloud-based GPU platforms
🔎 Similar Papers
No similar papers found.