🤖 AI Summary
This study investigates the relationship between pretraining data and downstream benchmark performance in language models, challenging the assumption that standard evaluation benchmarks are truly out-of-distribution. Through controlled experiments using models ranging from 400M to 3B parameters and diverse pretraining corpora, the authors analyze word-level unigram cross-entropy and word frequency statistics. They find that the lexical overlap between evaluation sets and pretraining data significantly influences zero-shot performance: word-level unigram cross-entropy exhibits a strong negative correlation with model accuracy, and larger pretraining corpora with lexical distributions closer to those of the evaluation sets substantially improve downstream results. These findings suggest that current mainstream benchmarks are only weakly out-of-distribution, thereby questioning conventional interpretations of out-of-distribution generalization in language models.
📝 Abstract
Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across $10$ zero-shot benchmarks, $4$ pre-training datasets spanning $8.5\mathrm{B}$ to $60\mathrm{B}$ tokens, and model sizes ranging from $400\mathrm{M}$ to $3\mathrm{B}$ parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.