MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Evaluation during large language model (LLM) pretraining exhibits significant instability, obscuring true learning dynamics and undermining assessment reliability. To address this, we propose MaP, a unified evaluation framework that— for the first time—integrates parameter-space smoothing (via checkpoint merging) with low-variance capability assessment (based on Pass@k), systematically disentangling and suppressing both parameter noise and evaluation noise. MaP enables smooth modeling in weight space while yielding robust performance estimates in output space, substantially reducing evaluation variance. This yields smoother, more reproducible training curves and improves consistency in model performance ranking across independent training runs. Experiments demonstrate that MaP effectively uncovers LLMs’ true convergence behavior and capability evolution, with negligible computational overhead.

Technology Category

Application Category

📝 Abstract

Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: extit{Parameter Instability} from training stochasticity and extit{Evaluation Instability} from noisy measurement protocols. To counteract both sources of noise, we introduce extbf{MaP}, a dual-pronged framework that synergistically integrates checkpoint underline{M}erging underline{a}nd the underline{P}ass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.

Problem

Research questions and friction points this paper is trying to address.

Diagnosing instability in LLM pre-training evaluation process

Addressing parameter and evaluation instability in model training

Providing reliable framework for observing LLM learning dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Checkpoint merging smooths parameter space

Pass@k metric reduces evaluation variance

Dual-pronged framework stabilizes training dynamics

🔎 Similar Papers

No similar papers found.