Toward Stable and Consistent Evaluation Results: A New Methodology for Base Model Evaluation

📅 2025-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two key challenges in evaluating foundation large language models (LLMs) during pretraining—before instruction tuning: (1) high volatility and low predictive utility of early-stage evaluation metrics, and (2) inconsistent ranking between foundation models and their instruction-tuned counterparts, hindering downstream performance prediction. To this end, we propose BOSE, a systematic evaluation framework introducing two novel components: In-Context Light-instruction Prompting (ICLiP) and Blank-ppl—a fill-in-the-blank perplexity metric. BOSE further incorporates Kendall’s tau rank correlation to quantitatively measure evaluation stability and cross-model consistency. By jointly leveraging light-instruction prompting, blank-ppl, open-ended generation, and multiple-choice tasks, BOSE significantly improves evaluation stability during pretraining and strengthens alignment between foundation model rankings and downstream instruction-tuned performance. This enables more reliable assessment for critical studies such as data ablation and scaling law analysis.

Technology Category

Application Category

📝 Abstract
This paper poses two critical issues in evaluating base models (without post-training): (1) Unstable evaluation during training: in the early stages of pre-training, the models lack the capability to answer questions as required, leading to unstable evaluation results. This instability makes it difficult to provide solid conclusions to guide the training, especially for key experiments such as data ablation and scaling law. (2) Inconsistency between base and instruct models: base models generally exhibit poorer evaluation performance compared to corresponding instruct models. This gap poses a challenge for assessing whether a base model with better evaluation can truly lead to a better instruct model. To address these issues, we propose Base model Oriented Systematic Evaluation (BOSE), a method specifically designed to optimize the evaluation of base models. Specifically, BOSE introduces two key innovations: In-Context Light-instruction Prompt (ICLiP) for open-ended tasks and Blank-ppl for multi-choice tasks with candidate options, which transforms the standard perplexity (ppl) metric into a fill-in-the-blank format to mitigate early-stage evaluation fluctuations. Furthermore, we are the first to propose Kendall's rank correlation to quantitatively measure the evaluation stability and consistency. Experimental results demonstrate that BOSE significantly enhances both the stability of evaluations during pre-training and the consistency between base and instruct models, thereby providing more reliable guidance for the LLMs' training.
Problem

Research questions and friction points this paper is trying to address.

Unstable evaluation results during early training stages.
Inconsistency between base and instruct model evaluations.
Need for reliable metrics to guide LLM training effectively.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces In-Context Light-instruction Prompt (ICLiP)
Develops Blank-ppl for multi-choice tasks
Proposes Kendall's rank correlation for stability measurement
🔎 Similar Papers
No similar papers found.
H
Hongzhi Luan
Changxin Tian
Changxin Tian
Renmin University of China & Ant Group
Large Language Models
Z
Zhaoxin Huan
X
Xiaolu Zhang
K
Kunlong Chen
Z
Zhiqiang Zhang
J
Jun Zhou