🤖 AI Summary
Evaluation during large language model (LLM) pretraining exhibits significant instability, obscuring true learning dynamics and undermining assessment reliability. To address this, we propose MaP, a unified evaluation framework that— for the first time—integrates parameter-space smoothing (via checkpoint merging) with low-variance capability assessment (based on Pass@k), systematically disentangling and suppressing both parameter noise and evaluation noise. MaP enables smooth modeling in weight space while yielding robust performance estimates in output space, substantially reducing evaluation variance. This yields smoother, more reproducible training curves and improves consistency in model performance ranking across independent training runs. Experiments demonstrate that MaP effectively uncovers LLMs’ true convergence behavior and capability evolution, with negligible computational overhead.
📝 Abstract
Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: extit{Parameter Instability} from training stochasticity and extit{Evaluation Instability} from noisy measurement protocols. To counteract both sources of noise, we introduce extbf{MaP}, a dual-pronged framework that synergistically integrates checkpoint underline{M}erging underline{a}nd the underline{P}ass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.